pith. sign in

arxiv: 2605.18781 · v1 · pith:BTN66VH7new · submitted 2026-05-05 · 💻 cs.SI · cs.AI· cs.CY

Can LLMs Emulate Human Belief Dynamics?

Pith reviewed 2026-05-20 23:37 UTC · model grok-4.3

classification 💻 cs.SI cs.AIcs.CY
keywords LLMsbelief dynamicssocial networksconformityopinion updatingsimulationreplicationhuman behavior
0
0 comments X

The pith

Large language models fail to match human starting beliefs and conform more strongly to network neighbors than people do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can reproduce the way humans form and revise beliefs inside social networks by repeating a prior human experiment with twelve different models. The models produce initial belief distributions unlike those seen in people and then adjust their answers more readily to match the average opinion of their neighbors. A sympathetic reader would care because many researchers now substitute these models for human subjects when studying opinion spread, polarization, or consensus; systematic differences in basic updating rules would make the resulting simulations unreliable for real populations. The work therefore both diagnoses a limitation in current LLMs and cautions against treating them as interchangeable stand-ins for human agents.

Core claim

When twelve LLMs drawn from multiple families and sizes were embedded in the identical network structure used in an earlier human study and allowed to observe and update beliefs in the same repeated manner, the models generated initial belief distributions that diverged from the human data and displayed greater conformity by shifting their responses closer to the average of their network neighbors.

What carries the argument

The repeated observation of neighbors' reported beliefs followed by an individual update step; this mechanism isolates whether LLM response patterns statistically match the conformity and homophily observed in human participants.

If this is right

  • Simulations built on LLMs will overstate the speed and extent of opinion convergence within groups.
  • The variety of starting opinions present at the beginning of a discussion will be lower than in actual human data.
  • Forecasts of how information spreads or polarizes in networks will not generalize from LLM agents to human populations.
  • Any study that uses LLMs as human proxies must first benchmark their updating behavior against human records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The extra conformity may arise because LLMs are trained on large text collections that already average many voices rather than preserving individual variation.
  • A follow-up test could check whether fine-tuning or prompt adjustments that force the model to match the original human belief spread reduces the conformity difference.
  • Similar replication checks could be performed on other computational agent models to determine whether the mismatch is specific to LLMs or common across many artificial representations of social agents.

Load-bearing premise

The specific prompting and interaction rules given to the LLMs create a situation equivalent to the one human subjects encountered in the original study.

What would settle it

Run the identical network experiment again while explicitly prompting the models to sample initial beliefs from the exact distribution reported in the human data; if the conformity gap remains, the claim that LLMs are unsuitable proxies is strengthened, but if the gap disappears the original difference was likely an artifact of the prompting method.

Figures

Figures reproduced from arXiv: 2605.18781 by Adiba Mahbub Proma, Ehsan Hoque, Gourab Ghoshal, Hangfeng He, James N. Druckman, Neeley Pate.

Figure 1
Figure 1. Figure 1: Experimental setup for the LLM simulations. In the actual experiments, each [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The plots in the right show that non-thinking models tend to overestimate, and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Stage 2 belief change analysis. (a) Top: Mean and standard deviation of belief [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Top-left: KDE distribution of Follow Signal (b) Bottom-left: KDE distribution [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spearman correlation heatmap showing belief change across various models and [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Can LLMs simulate how humans form and change beliefs in social networks? We put this to the test by replicating an established study on belief dynamics, evaluating 12 LLMs across multiple model families and parameter sizes. The answer is a clear no, and in systematic ways. LLMs fail to capture initial human belief distributions and tend to be overall more conformist than humans, shifting their responses to align with those around them. They also take a nuanced approach to emulating human homophilic tendencies within networks. Our findings carry a double payoff: they highlight fundamental properties of LLM behavior, and they raise a sharp warning against deploying LLMs as human proxies in social simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper replicates an established human belief dynamics study in social networks by evaluating 12 LLMs across model families and sizes. It claims LLMs fail to match initial human belief distributions, exhibit greater overall conformism by shifting responses to align with neighbors, and display nuanced emulation of homophilic tendencies, concluding that LLMs are unsuitable as human proxies in social simulations.

Significance. If the empirical results hold after addressing methodological gaps, the work provides concrete evidence of systematic divergences between LLM and human social cognition, with direct implications for the validity of LLM-based social simulations. The replication design and multi-model evaluation are strengths that could make the findings a useful benchmark if controls for prompting artifacts are strengthened.

major comments (2)
  1. [Methods] Methods section: the prompting protocol for network influence (describing neighbors' beliefs and eliciting updates) lacks ablations on phrasing variants, history length, or single-turn vs. multi-turn formats; this is load-bearing for the central conformism claim because the skeptic concern that explicit peer-belief surfacing may induce surface-level alignment rather than genuine updating cannot be ruled out without such tests.
  2. [Results] Results section: reported mismatches in initial belief distributions and conformism differences are presented without sample sizes, exact statistical methods, or controls for model stochasticity (e.g., temperature, multiple runs), undermining verification of robustness as highlighted in the soundness assessment.
minor comments (2)
  1. [Abstract] Abstract: 'multiple model families and parameter sizes' is stated without listing the specific models or sizes used, reducing clarity for readers.
  2. [Introduction] The paper should explicitly reference the original human study being replicated (citation and key design parameters) in the introduction to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the methodological transparency and robustness of our replication study. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Methods] Methods section: the prompting protocol for network influence (describing neighbors' beliefs and eliciting updates) lacks ablations on phrasing variants, history length, or single-turn vs. multi-turn formats; this is load-bearing for the central conformism claim because the skeptic concern that explicit peer-belief surfacing may induce surface-level alignment rather than genuine updating cannot be ruled out without such tests.

    Authors: We agree that the absence of systematic ablations on the prompting protocol represents a methodological gap that limits the strength of the conformism findings. The current protocol was chosen to closely replicate the original human study, but we recognize that variants in phrasing, history length, and interaction format could influence whether observed shifts reflect genuine updating or surface alignment. In the revised manuscript, we will add a new subsection in Methods describing ablation experiments on three key dimensions (phrasing variants, history lengths of 1 vs. 3 turns, and single-turn vs. multi-turn elicitation), with results reported in an expanded Results section showing their effects on conformism metrics. These additions will directly address the concern and improve the defensibility of the central claim. revision: yes

  2. Referee: [Results] Results section: reported mismatches in initial belief distributions and conformism differences are presented without sample sizes, exact statistical methods, or controls for model stochasticity (e.g., temperature, multiple runs), undermining verification of robustness as highlighted in the soundness assessment.

    Authors: We accept this criticism as valid; clearer reporting of sample sizes, statistical procedures, and stochasticity controls is necessary for reproducibility and robustness assessment. Although the manuscript describes the overall experimental scale (12 models, multiple network conditions), we will revise the Results section to include a dedicated reporting subsection that specifies exact sample sizes per condition, the precise statistical tests (e.g., Kolmogorov-Smirnov for distribution comparisons and t-tests for mean shifts), and controls for model stochasticity via repeated runs at fixed temperatures with standard errors and seed variation. These details will be added without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical replication study

full rationale

The paper performs a direct empirical replication by running LLMs through a belief-updating protocol and comparing outputs to human data from prior external studies. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. Claims rest on observable differences in simulated responses versus human benchmarks, which are falsifiable without self-reference. Self-citations, if any, support the replicated protocol but do not bear the load of the central empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the validity of the chosen human benchmark study and the assumption that LLM prompting faithfully probes emulation capacity rather than testing prompt sensitivity.

axioms (1)
  • domain assumption The replicated human study provides a valid and representative benchmark for belief dynamics in social networks.
    The paper uses this prior study as the ground truth against which LLM performance is measured.

pith-pipeline@v0.9.0 · 5656 in / 1005 out tokens · 36252 ms · 2026-05-20T23:37:46.640019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 10 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Argyle, Ethan C

    doi: 10.1017/pan.2023.2. Christopher A. Bail, Lisa P . Argyle, Taylor W. Brown, John P . Bumpus, Haohan Chen, M. B. Fallin Hunzaker, Jaemin Lee, Marcus Mann, Friedolin Merhout, and Alexander Volfovsky. Exposure to opposing views on social media can increase political polarization. Proceedings of the National Academy of Sciences, 115(37):9216–9221,

  3. [3]

    doi: 10.1073/pnas.1804840115

    ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1804840115. URL https://pnas.org/doi/full/10.1073/ pnas.1804840115. James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. Syn- thetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, October

  4. [4]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    ISSN 1047-1987, 1476-4989. doi: 10.1017/pan.2024.5. URL https://www.cambridge.org/core/product/identifier/S1047198724000056/type/ journal_article. Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Social sycophancy: A broader understanding of llm sycophancy.arXiv preprint arXiv:2505.13995,

  5. [5]

    Under review

    10 Preprint. Under review. Eunjung Cho, Alexander Miserlis Hoyle, and Yoan Hermstrüwer. Modeling motivated reasoning in law: Evaluating strategic role conditioning in llm summarization. In Niko- laos Aletras, Ilias Chalkidis, Leslie Barrett, C˘ at˘ alina Goant, ˘ a, Daniel Preot,iuc-Pietro, and Gerasimos Spanakis (eds.),Proceedings of the Natural Legal La...

  6. [6]

    ISBN 9798891763388

    Association for Com- putational Linguistics. ISBN 9798891763388. doi: 10.18653/v1/2025.nllp-1.7. URL https://aclanthology.org/2025.nllp-1.7/. Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy Rogers. Simulating opinion dynamics with networks of llm-based agents. In Kevin Duh, ...

  7. [7]

    Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

    ISSN 2375-2548. doi: 10.1126/sciadv.abm0137. URL https://www.science.org/doi/10. 1126/sciadv.abm0137. Saloni Dash, Amélie Reymond, Emma S. Spiro, and Aylin Caliskan. Persona-assigned large language models exhibit human-like motivated reasoning. (arXiv:2506.20020), June

  8. [8]

    Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

    doi: 10.48550/arXiv.2506.20020. URL http://arxiv.org/abs/2506.20020. arXiv:2506.20020. Morris H DeGroot. Reaching a Consensus.Journal of the American Statistical Association, 69 (345):118–121,

  9. [9]

    doi: 10.1038/s41558-018-0360-1

    ISSN 1758-6798. doi: 10.1038/s41558-018-0360-1. URL https://www.nature.com/articles/ s41558-018-0360-1. David Eil and Justin M Rao. The good news-bad news effect: Asymmetric processing of objective information about yourself.American Economic Journal: Microeconomics, 3 (2):114–138, May

  10. [10]

    The Llama 3 Herd of Models

    ISSN 1945-7669, 1945-7685. doi: 10.1257/mic.3.2.114. URL https://pubs.aeaweb.org/doi/10.1257/mic.3.2.114. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  11. [11]

    doi: 10.1073/pnas.2403116121

    ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.2403116121. URLhttps://pnas.org/doi/10.1073/pnas.2403116121. 11 Preprint. Under review. Tianrui Hu, Dimitrios Liakopoulos, Xiwen Wei, Radu Marculescu, and Neeraja J. Yadwadkar. Simulating rumor spreading in social networks using llm agents. (arXiv:2502.01450), February

  12. [12]

    URL http://arxiv.org/abs/2502.01450

    doi: 10.48550/arXiv.2502.01450. URL http://arxiv.org/abs/2502.01450. arXiv:2502.01450. Joshua Introne. Measuring belief dynamics on twitter.Proceedings of the International AAAI Conference on Web and Social Media, 17:387–398,

  13. [13]

    Gemma 3 Technical Report

    ISSN 2334-0770. doi: 10.1609/icwsm. v17i1.22154. URLhttps://ojs.aaai.org/index.php/ICWSM/article/view/22154. Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 4,

  14. [14]

    Equivalent comparisons of experiments.The Annals of Mathematical Statis- tics, 24(2):265–272, 1953

    ISSN 0003-4851. doi: 10.1214/aoms/1177729694. URL http://projecteuclid.org/euclid.aoms/1177729694. Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, and Zhifang Sui. How far are llms from being our digital twins? a benchmark for persona-based behavior chain simulation. InFindings of the Association for Computational Linguistics: ACL 2025...

  15. [15]

    doi: 10.1007/s11109-010-9112-2

    ISSN 1573-6687. doi: 10.1007/s11109-010-9112-2. URLhttps://doi.org/10.1007/s11109-010-9112-2. Brendan Nyhan, Ethan Porter, and Thomas J. Wood. Time and skeptical opinion content erode the effects of science coverage on climate beliefs and attitudes.Proceedings of the National Academy of Sciences, 119(26):e2122069119,

  16. [16]

    doi: 10.1073/pnas.2122069119

    ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.2122069119. URLhttps://pnas.org/doi/full/10.1073/pnas.2122069119. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software a...

  17. [17]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Mered- ith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109,

  18. [18]

    Replicating Human Motivated Reasoning Studies with LLMs

    Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N Druckman, Daniel Molden, Gourab Ghoshal, and Ehsan Hoque. Replicating human motivated reasoning studies with llms.arXiv preprint arXiv:2601.16130,

  19. [19]

    Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song

    ISBN 9781461249641. Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. Hidden persuaders: Llms’ political leaning and their influence on voters. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4244–4275, Miami, Florida, USA, November

  20. [20]

    doi: 10.18653/v1/2024.emnlp-main.244

    As- sociation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.244. URL https://aclanthology.org/2024.emnlp-main.244/. Adiba Proma, Neeley Pate, James Druckman, Gourab Ghoshal, and Ehsan Hoque. Per- sonalizing llm responses to combat political misinformation. InProceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personal...

  21. [21]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548,

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  23. [23]

    {statement}

    ISSN 0049-1241, 1552-8294. doi: 10.1177/00491241251327130. URLhttps://journals.sagepub.com/doi/10.1177/00491241251327130. A Prompts used for simulation The prompts used for simulation are provided below for each stage. For the analysis, memory_summaryis empty and is passed as an empty string. Stage 1 You are the person agent_id={agent_id}. You have the fo...

  24. [24]

    Note that in case an LLM fails to generate a certain instance, we drop that instance from the human data as well during comparing with that specific LLM

    Model Type Model LLM Mean LLM Std Actual Mean Actual Std Non-Thinking gemma3_4b 3.1037 0.9319 2.2518 1.2490 Non-Thinking gemma3_27b 2.9718 1.0121 2.2518 1.2490 Non-Thinking llama3_8b 2.3887 1.0145 2.2518 1.2490 Non-Thinking llama3_70b 2.2145 1.2247 2.2518 1.2490 Non-Thinking llama3.2_3b 2.0578 1.3569 2.3470 1.2159 Non-Thinking llama3.3_70b 2.5700 1.0092 2...

  25. [25]

    Under review

    D Knowledge Cutoff Dates for each Model 15 Preprint. Under review. Figure 5: Spearman correlation heatmap showing belief change across various models and human. Model Training Cutoff Date Source gemma3_4b March 2024 https://gradientflow.com/gemma-3-what-you-need-to-know/ gemma3_27b March 2024 https://gradientflow.com/gemma-3-what-you-need-to-know/ llama3_...