Recognition: no theorem link
When simulations look right but causal effects go wrong: Large language models as behavioral simulators
Pith reviewed 2026-05-13 20:31 UTC · model grok-4.3
The pith
Large language models reproduce observed attitudes in simulations but fail to capture the true causal effects of interventions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs reproduced observed attitudinal patterns from human survey data on eleven climate-psychology interventions reasonably well, with prompting refinements improving descriptive fit, yet this match did not translate into accurate estimates of causal intervention effects; the two dimensions of accuracy displayed distinct error structures that varied by intervention logic, outcome type, and population, with larger causal errors for experience-evoking interventions and for behavioral measures.
What carries the argument
The separation between descriptive fit (how closely LLM outputs match observed attitude distributions) and causal fidelity (how closely LLM-estimated intervention effects match human experimental effects).
If this is right
- Descriptive accuracy alone is insufficient to validate simulation-based forecasts of intervention impacts.
- Errors grow for interventions that require evoking internal experience rather than conveying explicit reasons or social norms.
- LLMs produce stronger attitude-behavior associations than appear in human data, inflating predicted behavioral change.
- Populations that appear well captured in attitude distributions can still yield large causal mis-estimates.
- Relying solely on descriptive checks risks both incorrect policy conclusions and undetected disparities across groups.
Where Pith is reading between the lines
- Separate causal-validation protocols, using randomized human trials on new interventions, would be needed before treating LLM outputs as policy guidance.
- The same descriptive-causal split may appear in other simulation domains such as health behavior or consumer choice, suggesting a general caution for LLM-based forecasting.
- Training objectives that explicitly reward accurate effect-size recovery, rather than only next-token prediction or attitude matching, could reduce the gap.
- Masked causal errors could lead to policy recommendations that appear equitable in aggregate but widen outcome gaps for under-represented subgroups.
Load-bearing premise
The human survey responses supply an unbiased ground-truth measure of real causal effects, and LLM prompts can encode intervention contexts and population traits without adding systematic distortions.
What would settle it
Finding that the magnitude of causal error on held-out interventions is strongly predicted by the magnitude of descriptive error across the same interventions would contradict the reported divergence.
read the original abstract
Behavioral simulation is increasingly used to anticipate responses to interventions. Large language models (LLMs) enable researchers to specify population characteristics and intervention context in natural language, but it remains unclear to what extent LLMs can use these inputs to infer intervention effects. We evaluated three LLMs on 11 climate-psychology interventions using a dataset of 59,508 participants from 62 countries, and replicated the main analysis in two additional datasets (12 and 27 countries). LLMs reproduced observed patterns in attitudinal outcomes (e.g., climate beliefs and policy support) reasonably well, and prompting refinements improved this descriptive fit. However, descriptive fit did not reliably translate into causal fidelity (i.e., accurate estimates of intervention effects), and these two dimensions of accuracy followed different error structures. This descriptive-causal divergence held across the three datasets, but varied across intervention logics, with larger errors for interventions that depended on evoking internal experience than on directly conveying reasons or social cues. It was more pronounced for behavioral outcomes, where LLMs imposed stronger attitude-behavior coupling than in human data. Countries and population groups appearing well captured descriptively were not necessarily those with lower causal errors. Relying on descriptive fit alone may therefore create unwarranted confidence in simulation results, misleading conclusions about intervention effects and masking population disparities that matter for fairness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates three LLMs on their ability to simulate human responses to 11 climate-psychology interventions, using a primary dataset of 59,508 participants from 62 countries plus replications in two smaller multi-country datasets. It reports that LLMs achieve reasonable descriptive fit to observed attitudinal patterns (e.g., beliefs and policy support) that improves with prompting refinements, but that this fit does not reliably extend to accurate recovery of causal intervention effects; the two accuracy dimensions exhibit distinct error structures, with larger causal errors for internal-experience interventions and behavioral outcomes where LLMs impose stronger attitude-behavior coupling than observed in humans. The descriptive-causal divergence persists across datasets, and descriptively well-captured countries/groups do not necessarily show lower causal error.
Significance. If the central descriptive-causal divergence finding holds after addressing benchmark validity, the result is significant for the growing use of LLMs as behavioral simulators in policy-oriented fields. It provides concrete evidence that descriptive matching alone can produce misleading inferences about intervention impacts and population disparities, with direct implications for fairness and decision-making in climate psychology and related domains. The multi-dataset replication and differentiation by intervention logic strengthen the contribution relative to purely descriptive LLM evaluations.
major comments (2)
- [§2.1 and §4.3] §2.1 and §4.3: The central claim that descriptive fit fails to translate into causal fidelity treats the survey responses as accurate ground-truth causal effects. However, the paper does not discuss or test for hypothetical bias inherent in vignette-based stated-preference designs, which is known to inflate effects for behavioral outcomes via social-desirability responding. This is load-bearing because the reported divergence (and the claim that it is not an artifact) depends on the human benchmark being unbiased; if the survey overestimates real-world causal impacts, the LLM error patterns may partly reflect benchmark mismatch rather than simulation failure.
- [§3] §3 (Prompting and intervention representation): Exact prompt templates, how intervention contexts and population characteristics are encoded in natural language, and the precise definition of 'causal fidelity' metrics (e.g., how intervention effects are computed from LLM outputs versus human data) are not fully specified. This is load-bearing for the causal-fidelity results because small changes in prompt framing can alter inferred effects, and without these details the claim that prompting refinements improve descriptive but not causal accuracy cannot be independently verified or stress-tested.
minor comments (2)
- [Figure 3] Figure 3 and associated text: The visualization of error structures across intervention logics would benefit from explicit error bars or confidence intervals on the causal-error differences to allow readers to assess whether the reported variation by intervention type is statistically reliable.
- [Introduction and Discussion] The paper cites prior work on LLM behavioral simulation but could add references to the literature on hypothetical bias in climate-psychology surveys (e.g., studies on stated vs. revealed preferences) to contextualize the ground-truth assumption.
Simulated Author's Rebuttal
We thank the referee for these constructive comments, which highlight important issues for the interpretation and reproducibility of our results. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The central claim that descriptive fit fails to translate into causal fidelity treats the survey responses as accurate ground-truth causal effects. However, the paper does not discuss or test for hypothetical bias inherent in vignette-based stated-preference designs, which is known to inflate effects for behavioral outcomes via social-desirability responding. This is load-bearing because the reported divergence depends on the human benchmark being unbiased.
Authors: We agree that hypothetical bias is a relevant limitation of vignette-based stated-preference data and should be explicitly discussed. In the revision we will add a paragraph in §2.1 acknowledging this issue, citing relevant literature on social-desirability effects in climate surveys, and clarifying that our claims are relative to the observed survey benchmark rather than to unobserved real-world behavior. At the same time, the core descriptive-causal divergence finding remains informative even under benchmark bias: any systematic inflation in the human data would affect the reference standard uniformly, yet LLMs still deviate from it in ways that differ systematically from their descriptive errors. We will also note that testing against real behavioral outcomes lies beyond the current scope but represents a valuable direction for future work. revision: yes
-
Referee: Exact prompt templates, how intervention contexts and population characteristics are encoded in natural language, and the precise definition of 'causal fidelity' metrics (e.g., how intervention effects are computed from LLM outputs versus human data) are not fully specified. This is load-bearing for the causal-fidelity results because small changes in prompt framing can alter inferred effects.
Authors: We accept that full specification of prompts and metrics is necessary for independent verification. In the revised manuscript we will add a new appendix containing the complete prompt templates for all three LLMs, including the exact phrasing used to encode intervention contexts, population characteristics, and outcome measures. We will also expand §3 to include the precise formulas and code-level definitions for both descriptive fit (e.g., mean absolute error on attitudinal items) and causal fidelity (difference-in-differences computed from LLM-generated responses versus human data), ensuring every step is reproducible. revision: yes
Circularity Check
No circularity: direct empirical comparison of LLM outputs to human survey data
full rationale
The paper performs an empirical evaluation by running LLMs on 11 climate-psychology interventions and comparing outputs to observed patterns in three human survey datasets (59,508 participants plus replications). No mathematical derivations, parameter fitting, or predictions are involved. The central claim—that descriptive fit does not reliably imply causal fidelity—rests on direct statistical comparisons of error structures across attitudinal and behavioral outcomes. No self-citations are load-bearing; the analysis is self-contained against external human benchmarks with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs prompted with natural language descriptions can simulate population-level responses to interventions
Reference graph
Works this paper leans on
-
[1]
E. P. Fenichel, C. Castillo-Chavez, M. G. Ceddia, et al. Adaptive human behavior in epidemiological models.Proceedings of the National Academy of Sciences, 108(15):6306–11, 2011
work page 2011
-
[2]
Kai Ruggeri, Friederike Stock, S. Alexander Haslam, et al. A synthesis of evidence for policy from behavioural science during covid-19.Nature, 625(7993):134–147, 2024
work page 2024
-
[3]
Jens Hainmueller, Dominik Hangartner, and Teppei Yamamoto. Validating vignette and conjoint survey experiments against real- world behavior.Proceedings of the National Academy of Sciences, 112(8):2395–2400, 2015
work page 2015
-
[4]
Patrik Michaelsen, Aksel Sundström, and Sverker C. Jagers. Mass support for conserving 30% of the earth by 2030: Experimen- tal evidence from five continents.Proceedings of the National Academy of Sciences, 122(35):e2503355122, 2025
work page 2030
-
[5]
E. Bruch and J. Atwell. Agent-based models in empirical so- cial research.Sociological Methods&Research, 44(2):186–221, 2015
work page 2015
-
[6]
Marco Pangallo, Alberto Aleta, R. Maria del Rio-Chanona, et al. The unequal effects of the health–economy trade-offduring the covid-19 pandemic.Nature Human Behaviour, 8(2):264–275, 2024
work page 2024
-
[7]
A. Sorgente, R. Caliciuri, M. Robba, M. Lanz, and B. D. Zumbo. A systematic review of latent class analysis in psychology: Exam- ining the gap between guidelines and research practice.Behavior Research Methods, 57(11):301, 2025. When simulations look right but causal effects go wrong: Large language models as beha vioral simulators12
work page 2025
-
[8]
Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, et al. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023
work page 2023
-
[9]
Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M
James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024
work page 2024
-
[10]
Specializing large language models to simulate survey response distributions for global populations
Yong Cao, Haijiang Liu, Arnav Arora, et al. Specializing large language models to simulate survey response distributions for global populations. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), page 3141–3154. Association for...
work page 2025
-
[11]
Finetuning llms for human behavior prediction in so- cial science experiments
Akaash Kolluri, Shengguang Wu, Joon Sung Park, and Michael S Bernstein. Finetuning llms for human behavior prediction in so- cial science experiments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page 30084–30099, 2025
work page 2025
-
[12]
Christopher A. Bail. Can generative ai improve social sci- ence?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024
work page 2024
-
[13]
Pujen Shrestha, Dario Krpan, Fatima Koaik, et al. Beyond weird: Can synthetic survey participants substitute for humans in global policy research?Behavioral Science&Policy, 10(2):26–45, 2024
work page 2024
-
[14]
Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, et al. Large lan- guage models surpass human experts in predicting neuroscience results.Nature Human Behaviour, 9(2):305–315, 2025
work page 2025
-
[15]
Lisa P. Argyle, Ethan C. Busby, Joshua R. Gubler, et al. Test- ing theories of political persuasion using ai.Proceedings of the National Academy of Sciences, 122(18):e2412815122, 2025
work page 2025
-
[16]
Chen Gao, Xiaochong Lan, Nian Li, et al. Large language models empowered agent-based modeling and simulation: a survey and perspectives.Humanities and Social Sciences Communications, 11(1):1259, 2024
work page 2024
-
[17]
Carolin Kaiser, Jakob Kaiser, Vladimir Manewitsch, Lea Rau, and Rene Schallner. Simulating human opinions with large language models: Opportunities and challenges for personalized survey data modeling, 2025
work page 2025
-
[18]
Pat Pataranutaporn, Nattavudh Powdthavee, Chayapatr Archi- waranguprok, and Pattie Maes. Simulating human well-being with large language models: Systematic validation and misestima- tion across 64,000 individuals from 64 countries.Proceedings of the National Academy of Sciences, 122(48):e2519394122, 2025
work page 2025
-
[19]
A founda- tion model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025
Marcel Binz, Elif Akata, Matthias Bethge, et al. A founda- tion model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025
work page 2025
-
[20]
Using large language models to simulate multiple humans and replicate human subject studies
Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, page 337–371. PMLR, 2023
work page 2023
-
[21]
Ziyan Cui, Ning Li, and Huaikang Zhou. A large-scale replication of scenario-based experiments in psychology and management using large language models.Nature Computational Science, 5(8):627–634, 2025
work page 2025
-
[22]
Marcelo Sartori Locatelli, Pedro Dutenhefner, Arthur Buzelin, et al. Ai and climate change discourse: What opinions do large language models present? InProceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (Cli- mateNLP 2025), page 113–125. Association for Computational Linguistics, 2025
work page 2025
-
[23]
Jinghua Piao, Yuwei Yan, Jun Zhang, et al. Agentsociety: Large- scale simulation of llm-driven generative agents advances un- derstanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Predicting results of social science experiments using large language models.Preprint, 2024
Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024
work page 2024
-
[25]
Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025
work page 2025
-
[26]
Olivier Toubia, George Z. Gui, Tianyi Peng, et al. Database report: Twin-2k-500: A data set for building digital twins of over 2,000 people based on their answers to over 500 questions.Marketing Science, 44(6):1446–1455, 2025
work page 2025
-
[27]
Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. This human study did not involve human subjects: Validat- ing llm simulations as behavioral evidence.arXiv preprint arXiv:2602.15785, 2026
-
[28]
Doell, Boryana Todorova, Madalina Vlasceanu, et al
Kimberly C. Doell, Boryana Todorova, Madalina Vlasceanu, et al. The international climate psychology collaboration: Climate change-related data collected from 63 countries.Scientific Data, 11(1):1066, 2024
work page 2024
-
[29]
Madalina Vlasceanu, Kimberly C. Doell, Joseph B. Bak-Coleman, et al. Addressing climate change with behavioral science: A global intervention tournament in 63 countries.Science Advances, 10(6):eadj5778, 2024
work page 2024
-
[30]
Tobia Spampatti, Ulf J. J. Hahnel, Evelina Trutnevyte, and Tobias Brosch. Psychological inoculation strategies to fight climate disinformation across 12 countries.Nature Human Behaviour, 8(2):380–398, 2024
work page 2024
-
[31]
Geiger, František Bartoš, et al
Bojana Ve´ckalov, Sandra J. Geiger, František Bartoš, et al. A 27- country test of communicating the scientific consensus on climate change.Nature Human Behaviour, 8(10):1892–1905, 2024
work page 1905
-
[32]
George Gui and Olivier Toubia. The challenge of using llms to simulate human behavior: A causal inference perspective.arXiv preprint arXiv:2312.15524, 2023
-
[33]
V . A. Shaffer, E. S. Focella, A. Hathaway, L. D. Scherer, and B. J. Zikmund-Fisher. On the usefulness of narratives: An in- terdisciplinary review and theoretical model.Ann Behav Med, 52(5):429–442, 2018
work page 2018
- [34]
-
[35]
Tiancheng Hu, Yara Kyrychenko, Steve Rathje, et al. Generative language models exhibit social identity biases.Nature Computa- tional Science, 5(1):65–75, 2025
work page 2025
-
[36]
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, et al. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024
work page 2024
-
[37]
Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza Dehghani. A primer for evaluating large language models in social-science research. Advances in Methods and Practices in Psychological Science, 8(2):25152459251325174, 2025
work page 2025
-
[38]
Florian Lange and Siegfried Dewitte. The work for environmental protection task: A consequential web-based procedure for study- ing pro-environmental behavior.Behavior Research Methods, 54(1):133–145, 2022
work page 2022
-
[39]
Paul C. Stern, Thomas Dietz, Troy Abel, Gregory A. Guagnano, and Linda Kalof. A value-belief-norm theory of support for social movements: The case of environmentalism.Human Ecology Review, 6(2):81–97, 1999. When simulations look right but causal effects go wrong: Large language models as beha vioral simulators13 SupplementaryInformation Table S1: Individu...
work page 1999
-
[40]
evidence: list 2–4 most relevant explicit profile items (must be explicit; no guessing)
-
[41]
values: Based on this person’s political orientation, social position, and education, infer their dominant value orientation on the Schwartz self-enhancement↔self-transcendence dimension. – self-enhancement: prioritizes personal status, wealth, and authority – self-transcendence: prioritizes welfare of others, nature, and equality – mixed: no clear domina...
-
[42]
VBN labels (derived from the profile and inferred values; interpreted for belief/accuracy judgments): – AC: Low|Medium|High. How likely does the profile indicate they have strong perceived severity of climate impacts? – AR: Low|Medium|High. How likely does the profile suggest they attribute climate change to human activity rather than natural causes? – PN...
-
[43]
synthesis: 2–3 sentences (max 50 words) citing which evidence supports the labels, and 2–3 sentences inferring how these VBN factors drive the final answer. Output JSON ONLY: { "evidence": ["...", "..."], "values": "self-enhancement | mixed | self-transcendence", "VBN": {"AC":"Low|Medium|High","AR":"Low|Medium|High","PN":"Low|Medium|High"}, "synthesis": "...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.