arxiv: 2604.02458 · v2 · submitted 2026-04-02 · 💻 cs.CY · cs.AI· cs.ET

Recognition: no theorem link

When simulations look right but causal effects go wrong: Large language models as behavioral simulators

Zonghan Li , Feng Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:31 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.ET

keywords large language modelsbehavioral simulationcausal fidelitydescriptive accuracyclimate interventionsattitude-behavior couplingintervention effectspopulation disparities

0 comments

The pith

Large language models reproduce observed attitudes in simulations but fail to capture the true causal effects of interventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can serve as reliable behavioral simulators by feeding them natural-language descriptions of populations and interventions, then checking the outputs against large-scale human survey data on climate-psychology interventions. LLMs matched the patterns of attitudes such as climate beliefs and policy support reasonably well, and better prompting narrowed that descriptive gap. Yet the same models produced intervention-effect estimates that diverged from the human data, and the size of those causal errors followed different patterns from the descriptive errors. This split persisted across three independent country samples and was larger for interventions that relied on internal feelings rather than direct information or social cues, and for behavioral outcomes where the models imposed tighter attitude-behavior links than humans show. Because countries and groups that looked well simulated descriptively were not the ones with smallest causal errors, the work warns that visual agreement with existing data can create false confidence about what would actually happen under policy changes.

Core claim

LLMs reproduced observed attitudinal patterns from human survey data on eleven climate-psychology interventions reasonably well, with prompting refinements improving descriptive fit, yet this match did not translate into accurate estimates of causal intervention effects; the two dimensions of accuracy displayed distinct error structures that varied by intervention logic, outcome type, and population, with larger causal errors for experience-evoking interventions and for behavioral measures.

What carries the argument

The separation between descriptive fit (how closely LLM outputs match observed attitude distributions) and causal fidelity (how closely LLM-estimated intervention effects match human experimental effects).

If this is right

Descriptive accuracy alone is insufficient to validate simulation-based forecasts of intervention impacts.
Errors grow for interventions that require evoking internal experience rather than conveying explicit reasons or social norms.
LLMs produce stronger attitude-behavior associations than appear in human data, inflating predicted behavioral change.
Populations that appear well captured in attitude distributions can still yield large causal mis-estimates.
Relying solely on descriptive checks risks both incorrect policy conclusions and undetected disparities across groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separate causal-validation protocols, using randomized human trials on new interventions, would be needed before treating LLM outputs as policy guidance.
The same descriptive-causal split may appear in other simulation domains such as health behavior or consumer choice, suggesting a general caution for LLM-based forecasting.
Training objectives that explicitly reward accurate effect-size recovery, rather than only next-token prediction or attitude matching, could reduce the gap.
Masked causal errors could lead to policy recommendations that appear equitable in aggregate but widen outcome gaps for under-represented subgroups.

Load-bearing premise

The human survey responses supply an unbiased ground-truth measure of real causal effects, and LLM prompts can encode intervention contexts and population traits without adding systematic distortions.

What would settle it

Finding that the magnitude of causal error on held-out interventions is strongly predicted by the magnitude of descriptive error across the same interventions would contradict the reported divergence.

read the original abstract

Behavioral simulation is increasingly used to anticipate responses to interventions. Large language models (LLMs) enable researchers to specify population characteristics and intervention context in natural language, but it remains unclear to what extent LLMs can use these inputs to infer intervention effects. We evaluated three LLMs on 11 climate-psychology interventions using a dataset of 59,508 participants from 62 countries, and replicated the main analysis in two additional datasets (12 and 27 countries). LLMs reproduced observed patterns in attitudinal outcomes (e.g., climate beliefs and policy support) reasonably well, and prompting refinements improved this descriptive fit. However, descriptive fit did not reliably translate into causal fidelity (i.e., accurate estimates of intervention effects), and these two dimensions of accuracy followed different error structures. This descriptive-causal divergence held across the three datasets, but varied across intervention logics, with larger errors for interventions that depended on evoking internal experience than on directly conveying reasons or social cues. It was more pronounced for behavioral outcomes, where LLMs imposed stronger attitude-behavior coupling than in human data. Countries and population groups appearing well captured descriptively were not necessarily those with lower causal errors. Relying on descriptive fit alone may therefore create unwarranted confidence in simulation results, misleading conclusions about intervention effects and masking population disparities that matter for fairness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs match descriptive patterns in climate attitudes reasonably well but diverge on causal effects of interventions, with different error structures by intervention type.

read the letter

The main thing to know is that this paper shows LLMs can reproduce observed attitudinal patterns from a large climate survey reasonably well, but their estimates of how interventions change those attitudes and behaviors do not track the human data closely, and the two kinds of accuracy follow separate error patterns. This held across three datasets and varied by intervention logic, with bigger problems for interventions that try to shift internal experiences rather than deliver direct reasons or social cues. The models also imposed tighter attitude-behavior links than appeared in the human responses, and groups that looked well matched descriptively were not always the ones with accurate causal estimates. Prompt refinements helped description but did not fix the causal side. This is a useful empirical demonstration because it directly tests whether descriptive fit predicts causal fidelity in the same LLM setup and flags the risk of over-reliance for fairness assessments. The scale helps: 59k participants across 62 countries, plus replications. The soft spot is the human survey as ground truth. These climate-psychology studies typically rely on vignettes and stated preferences, which often inflate effects through social desirability and low stakes. If that bias is present, the LLM divergence could partly reflect mismatched hypothetical contexts rather than a clear simulation failure, especially since errors were larger precisely where internal-experience interventions are involved. The paper would be stronger with any available check against real behavioral outcomes. This is for researchers in behavioral science, AI evaluation, and policy simulation who might use LLMs to test interventions. It deserves peer review because the pattern is replicated and the practical warning about descriptive-causal gaps is worth discussing, even with the benchmark limits.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three LLMs on their ability to simulate human responses to 11 climate-psychology interventions, using a primary dataset of 59,508 participants from 62 countries plus replications in two smaller multi-country datasets. It reports that LLMs achieve reasonable descriptive fit to observed attitudinal patterns (e.g., beliefs and policy support) that improves with prompting refinements, but that this fit does not reliably extend to accurate recovery of causal intervention effects; the two accuracy dimensions exhibit distinct error structures, with larger causal errors for internal-experience interventions and behavioral outcomes where LLMs impose stronger attitude-behavior coupling than observed in humans. The descriptive-causal divergence persists across datasets, and descriptively well-captured countries/groups do not necessarily show lower causal error.

Significance. If the central descriptive-causal divergence finding holds after addressing benchmark validity, the result is significant for the growing use of LLMs as behavioral simulators in policy-oriented fields. It provides concrete evidence that descriptive matching alone can produce misleading inferences about intervention impacts and population disparities, with direct implications for fairness and decision-making in climate psychology and related domains. The multi-dataset replication and differentiation by intervention logic strengthen the contribution relative to purely descriptive LLM evaluations.

major comments (2)

[§2.1 and §4.3] §2.1 and §4.3: The central claim that descriptive fit fails to translate into causal fidelity treats the survey responses as accurate ground-truth causal effects. However, the paper does not discuss or test for hypothetical bias inherent in vignette-based stated-preference designs, which is known to inflate effects for behavioral outcomes via social-desirability responding. This is load-bearing because the reported divergence (and the claim that it is not an artifact) depends on the human benchmark being unbiased; if the survey overestimates real-world causal impacts, the LLM error patterns may partly reflect benchmark mismatch rather than simulation failure.
[§3] §3 (Prompting and intervention representation): Exact prompt templates, how intervention contexts and population characteristics are encoded in natural language, and the precise definition of 'causal fidelity' metrics (e.g., how intervention effects are computed from LLM outputs versus human data) are not fully specified. This is load-bearing for the causal-fidelity results because small changes in prompt framing can alter inferred effects, and without these details the claim that prompting refinements improve descriptive but not causal accuracy cannot be independently verified or stress-tested.

minor comments (2)

[Figure 3] Figure 3 and associated text: The visualization of error structures across intervention logics would benefit from explicit error bars or confidence intervals on the causal-error differences to allow readers to assess whether the reported variation by intervention type is statistically reliable.
[Introduction and Discussion] The paper cites prior work on LLM behavioral simulation but could add references to the literature on hypothetical bias in climate-psychology surveys (e.g., studies on stated vs. revealed preferences) to contextualize the ground-truth assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important issues for the interpretation and reproducibility of our results. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: The central claim that descriptive fit fails to translate into causal fidelity treats the survey responses as accurate ground-truth causal effects. However, the paper does not discuss or test for hypothetical bias inherent in vignette-based stated-preference designs, which is known to inflate effects for behavioral outcomes via social-desirability responding. This is load-bearing because the reported divergence depends on the human benchmark being unbiased.

Authors: We agree that hypothetical bias is a relevant limitation of vignette-based stated-preference data and should be explicitly discussed. In the revision we will add a paragraph in §2.1 acknowledging this issue, citing relevant literature on social-desirability effects in climate surveys, and clarifying that our claims are relative to the observed survey benchmark rather than to unobserved real-world behavior. At the same time, the core descriptive-causal divergence finding remains informative even under benchmark bias: any systematic inflation in the human data would affect the reference standard uniformly, yet LLMs still deviate from it in ways that differ systematically from their descriptive errors. We will also note that testing against real behavioral outcomes lies beyond the current scope but represents a valuable direction for future work. revision: yes
Referee: Exact prompt templates, how intervention contexts and population characteristics are encoded in natural language, and the precise definition of 'causal fidelity' metrics (e.g., how intervention effects are computed from LLM outputs versus human data) are not fully specified. This is load-bearing for the causal-fidelity results because small changes in prompt framing can alter inferred effects.

Authors: We accept that full specification of prompts and metrics is necessary for independent verification. In the revised manuscript we will add a new appendix containing the complete prompt templates for all three LLMs, including the exact phrasing used to encode intervention contexts, population characteristics, and outcome measures. We will also expand §3 to include the precise formulas and code-level definitions for both descriptive fit (e.g., mean absolute error on attitudinal items) and causal fidelity (difference-in-differences computed from LLM-generated responses versus human data), ensuring every step is reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of LLM outputs to human survey data

full rationale

The paper performs an empirical evaluation by running LLMs on 11 climate-psychology interventions and comparing outputs to observed patterns in three human survey datasets (59,508 participants plus replications). No mathematical derivations, parameter fitting, or predictions are involved. The central claim—that descriptive fit does not reliably imply causal fidelity—rests on direct statistical comparisons of error structures across attitudinal and behavioral outcomes. No self-citations are load-bearing; the analysis is self-contained against external human benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparison to human survey data; no free parameters, invented entities, or non-standard axioms are introduced beyond the domain assumption that LLMs can be prompted to simulate populations.

axioms (1)

domain assumption LLMs prompted with natural language descriptions can simulate population-level responses to interventions
Core setup of the evaluation; invoked throughout the abstract's description of the simulation process.

pith-pipeline@v0.9.0 · 5534 in / 1224 out tokens · 43847 ms · 2026-05-13T20:31:11.022751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

E. P. Fenichel, C. Castillo-Chavez, M. G. Ceddia, et al. Adaptive human behavior in epidemiological models.Proceedings of the National Academy of Sciences, 108(15):6306–11, 2011

work page 2011
[2]

Alexander Haslam, et al

Kai Ruggeri, Friederike Stock, S. Alexander Haslam, et al. A synthesis of evidence for policy from behavioural science during covid-19.Nature, 625(7993):134–147, 2024

work page 2024
[3]

Validating vignette and conjoint survey experiments against real- world behavior.Proceedings of the National Academy of Sciences, 112(8):2395–2400, 2015

Jens Hainmueller, Dominik Hangartner, and Teppei Yamamoto. Validating vignette and conjoint survey experiments against real- world behavior.Proceedings of the National Academy of Sciences, 112(8):2395–2400, 2015

work page 2015
[4]

Patrik Michaelsen, Aksel Sundström, and Sverker C. Jagers. Mass support for conserving 30% of the earth by 2030: Experimen- tal evidence from five continents.Proceedings of the National Academy of Sciences, 122(35):e2503355122, 2025

work page 2030
[5]

Bruch and J

E. Bruch and J. Atwell. Agent-based models in empirical so- cial research.Sociological Methods&Research, 44(2):186–221, 2015

work page 2015
[6]

Maria del Rio-Chanona, et al

Marco Pangallo, Alberto Aleta, R. Maria del Rio-Chanona, et al. The unequal effects of the health–economy trade-offduring the covid-19 pandemic.Nature Human Behaviour, 8(2):264–275, 2024

work page 2024
[7]

Sorgente, R

A. Sorgente, R. Caliciuri, M. Robba, M. Lanz, and B. D. Zumbo. A systematic review of latent class analysis in psychology: Exam- ining the gap between guidelines and research practice.Behavior Research Methods, 57(11):301, 2025. When simulations look right but causal effects go wrong: Large language models as beha vioral simulators12

work page 2025
[8]

Argyle, Ethan C

Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, et al. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023

work page 2023
[9]

Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M

James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

work page 2024
[10]

Specializing large language models to simulate survey response distributions for global populations

Yong Cao, Haijiang Liu, Arnav Arora, et al. Specializing large language models to simulate survey response distributions for global populations. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), page 3141–3154. Association for...

work page 2025
[11]

Finetuning llms for human behavior prediction in so- cial science experiments

Akaash Kolluri, Shengguang Wu, Joon Sung Park, and Michael S Bernstein. Finetuning llms for human behavior prediction in so- cial science experiments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page 30084–30099, 2025

work page 2025
[12]

Christopher A. Bail. Can generative ai improve social sci- ence?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

work page 2024
[13]

Beyond weird: Can synthetic survey participants substitute for humans in global policy research?Behavioral Science&Policy, 10(2):26–45, 2024

Pujen Shrestha, Dario Krpan, Fatima Koaik, et al. Beyond weird: Can synthetic survey participants substitute for humans in global policy research?Behavioral Science&Policy, 10(2):26–45, 2024

work page 2024
[14]

Large lan- guage models surpass human experts in predicting neuroscience results.Nature Human Behaviour, 9(2):305–315, 2025

Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, et al. Large lan- guage models surpass human experts in predicting neuroscience results.Nature Human Behaviour, 9(2):305–315, 2025

work page 2025
[15]

Argyle, Ethan C

Lisa P. Argyle, Ethan C. Busby, Joshua R. Gubler, et al. Test- ing theories of political persuasion using ai.Proceedings of the National Academy of Sciences, 122(18):e2412815122, 2025

work page 2025
[16]

Large language models empowered agent-based modeling and simulation: a survey and perspectives.Humanities and Social Sciences Communications, 11(1):1259, 2024

Chen Gao, Xiaochong Lan, Nian Li, et al. Large language models empowered agent-based modeling and simulation: a survey and perspectives.Humanities and Social Sciences Communications, 11(1):1259, 2024

work page 2024
[17]

Simulating human opinions with large language models: Opportunities and challenges for personalized survey data modeling, 2025

Carolin Kaiser, Jakob Kaiser, Vladimir Manewitsch, Lea Rau, and Rene Schallner. Simulating human opinions with large language models: Opportunities and challenges for personalized survey data modeling, 2025

work page 2025
[18]

Pat Pataranutaporn, Nattavudh Powdthavee, Chayapatr Archi- waranguprok, and Pattie Maes. Simulating human well-being with large language models: Systematic validation and misestima- tion across 64,000 individuals from 64 countries.Proceedings of the National Academy of Sciences, 122(48):e2519394122, 2025

work page 2025
[19]

A founda- tion model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

Marcel Binz, Elif Akata, Matthias Bethge, et al. A founda- tion model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

work page 2025
[20]

Using large language models to simulate multiple humans and replicate human subject studies

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, page 337–371. PMLR, 2023

work page 2023
[21]

A large-scale replication of scenario-based experiments in psychology and management using large language models.Nature Computational Science, 5(8):627–634, 2025

Ziyan Cui, Ning Li, and Huaikang Zhou. A large-scale replication of scenario-based experiments in psychology and management using large language models.Nature Computational Science, 5(8):627–634, 2025

work page 2025
[22]

Marcelo Sartori Locatelli, Pedro Dutenhefner, Arthur Buzelin, et al. Ai and climate change discourse: What opinions do large language models present? InProceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (Cli- mateNLP 2025), page 113–125. Association for Computational Linguistics, 2025

work page 2025
[23]

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Jinghua Piao, Yuwei Yan, Jun Zhang, et al. Agentsociety: Large- scale simulation of llm-driven generative agents advances un- derstanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Predicting results of social science experiments using large language models.Preprint, 2024

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

work page 2024
[25]

Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025

Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025

work page 2025
[26]

Gui, Tianyi Peng, et al

Olivier Toubia, George Z. Gui, Tianyi Peng, et al. Database report: Twin-2k-500: A data set for building digital twins of over 2,000 people based on their answers to over 500 questions.Marketing Science, 44(6):1446–1455, 2025

work page 2025
[27]

This human study did not involve human subjects: Validat- ing llm simulations as behavioral evidence.arXiv preprint arXiv:2602.15785, 2026

Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. This human study did not involve human subjects: Validat- ing llm simulations as behavioral evidence.arXiv preprint arXiv:2602.15785, 2026

work page arXiv 2026
[28]

Doell, Boryana Todorova, Madalina Vlasceanu, et al

Kimberly C. Doell, Boryana Todorova, Madalina Vlasceanu, et al. The international climate psychology collaboration: Climate change-related data collected from 63 countries.Scientific Data, 11(1):1066, 2024

work page 2024
[29]

Doell, Joseph B

Madalina Vlasceanu, Kimberly C. Doell, Joseph B. Bak-Coleman, et al. Addressing climate change with behavioral science: A global intervention tournament in 63 countries.Science Advances, 10(6):eadj5778, 2024

work page 2024
[30]

Tobia Spampatti, Ulf J. J. Hahnel, Evelina Trutnevyte, and Tobias Brosch. Psychological inoculation strategies to fight climate disinformation across 12 countries.Nature Human Behaviour, 8(2):380–398, 2024

work page 2024
[31]

Geiger, František Bartoš, et al

Bojana Ve´ckalov, Sandra J. Geiger, František Bartoš, et al. A 27- country test of communicating the scientific consensus on climate change.Nature Human Behaviour, 8(10):1892–1905, 2024

work page 1905
[32]

The challenge of using llms to simulate human behavior: A causal inference perspective.arXiv preprint arXiv:2312.15524, 2023

George Gui and Olivier Toubia. The challenge of using llms to simulate human behavior: A causal inference perspective.arXiv preprint arXiv:2312.15524, 2023

work page arXiv 2023
[33]

V . A. Shaffer, E. S. Focella, A. Hathaway, L. D. Scherer, and B. J. Zikmund-Fisher. On the usefulness of narratives: An in- terdisciplinary review and theoretical model.Ann Behav Med, 52(5):429–442, 2018

work page 2018
[34]

Vieira, S

J. Vieira, S. L. Castro, and A. S. Souza. Psychological barriers moderate the attitude-behavior gap for climate change.PLoS One, 18(7):e0287404, 2023

work page 2023
[35]

Generative language models exhibit social identity biases.Nature Computa- tional Science, 5(1):65–75, 2025

Tiancheng Hu, Yara Kyrychenko, Steve Rathje, et al. Generative language models exhibit social identity biases.Nature Computa- tional Science, 5(1):65–75, 2025

work page 2025
[36]

Gallegos, Ryan A

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, et al. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024

work page 2024
[37]

Moore, Daniel M

Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza Dehghani. A primer for evaluating large language models in social-science research. Advances in Methods and Practices in Psychological Science, 8(2):25152459251325174, 2025

work page 2025
[38]

The work for environmental protection task: A consequential web-based procedure for study- ing pro-environmental behavior.Behavior Research Methods, 54(1):133–145, 2022

Florian Lange and Siegfried Dewitte. The work for environmental protection task: A consequential web-based procedure for study- ing pro-environmental behavior.Behavior Research Methods, 54(1):133–145, 2022

work page 2022
[39]

99% of expert climate scientists agree that Earth is warming and climate change is happening, mainly because of human activity

Paul C. Stern, Thomas Dietz, Troy Abel, Gregory A. Guagnano, and Linda Kalof. A value-belief-norm theory of support for social movements: The case of environmentalism.Human Ecology Review, 6(2):81–97, 1999. When simulations look right but causal effects go wrong: Large language models as beha vioral simulators13 SupplementaryInformation Table S1: Individu...

work page 1999
[40]

evidence: list 2–4 most relevant explicit profile items (must be explicit; no guessing)

work page
[41]

values: Based on this person’s political orientation, social position, and education, infer their dominant value orientation on the Schwartz self-enhancement↔self-transcendence dimension. – self-enhancement: prioritizes personal status, wealth, and authority – self-transcendence: prioritizes welfare of others, nature, and equality – mixed: no clear domina...

work page
[42]

How likely does the profile indicate they have strong perceived severity of climate impacts? – AR: Low|Medium|High

VBN labels (derived from the profile and inferred values; interpreted for belief/accuracy judgments): – AC: Low|Medium|High. How likely does the profile indicate they have strong perceived severity of climate impacts? – AR: Low|Medium|High. How likely does the profile suggest they attribute climate change to human activity rather than natural causes? – PN...

work page
[43]

evidence

synthesis: 2–3 sentences (max 50 words) citing which evidence supports the labels, and 2–3 sentences inferring how these VBN factors drive the final answer. Output JSON ONLY: { "evidence": ["...", "..."], "values": "self-enhancement | mixed | self-transcendence", "VBN": {"AC":"Low|Medium|High","AR":"Low|Medium|High","PN":"Low|Medium|High"}, "synthesis": "...

work page