arxiv: 2604.19787 · v1 · submitted 2026-03-31 · 💻 cs.CL · cs.AI· cs.CY

Recognition: no theorem link

LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans

Ljubisa Bojic , Alexander Felfernig , Bojana Dinic , Velibor Ilic , Achim Rettinger , Vera Mevorah , Damian Trilling

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords LLM agentssocial media predictionpersona simulationtext classifierszero-shot promptingMatthews correlation coefficienthuman reaction forecastingAI behavioral fidelity

0 comments

The pith

LLM agents can predict specific individuals' social media reactions above chance but are outperformed by conventional text classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks whether LLM agents given detailed personas from real survey data can forecast how those same people would react to social media content through likes, dislikes, comments, shares, or no reaction. Across more than 120,000 agent simulations drawn from 1,511 participants, the agents reached 70.7 percent accuracy overall and a Matthews Correlation Coefficient of 0.29 on forced like-or-dislike choices, confirming a signal stronger than random guessing. Standard TF-IDF text classifiers achieved a higher MCC of 0.36, indicating that the agents' edge stems from semantic access in the text rather than distinctive agent reasoning. The results matter because zero-shot agents require no task training and could therefore be scaled quickly to model opinion dynamics or test platform interventions.

Core claim

Zero-shot LLM agents prompted with survey-derived personas from 1,511 individuals achieve genuine predictive validity for social media reactions, reaching an MCC of 0.29 in binary like/dislike settings across 27 models, yet they are surpassed by supervised TF-IDF classifiers at MCC 0.36. Overall accuracy hits 70.7 percent with a 13-point spread tied to model choice. The agents' lack of training requirement enables broad deployment for simulating reactions while raising concerns about easy creation of behaviorally distinct AI swarms.

What carries the argument

Persona-prompted zero-shot LLM agents derived from survey responses of 1,511 participants, evaluated on 120,000+ reaction predictions against ground-truth human labels.

If this is right

Zero-shot agents allow large-scale simulation of polarization without any task-specific training data.
The demonstrated predictive signal supports using agents to forecast public discourse outcomes for policy design.
Easy deployment of such agents creates a concrete risk of coordinated manipulation on social platforms.
Model choice matters, as performance varied by 13 percentage points across the 27 LLMs tested.
Single-country data collection limits claims about universal applicability of the personas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If agents can stand in for individuals this closely, platforms may require new methods to detect synthetic engagement at scale.
Testing the same setup on multilingual posts would show whether cultural differences are captured by the current personas or require separate handling.
Fine-tuning agents on reaction data could narrow the gap with classifiers, but would remove the zero-shot advantage that makes deployment cheap.
For pure prediction accuracy on known content types, traditional supervised classifiers remain the practical choice over agents.

Load-bearing premise

Survey-derived personas from one country encode the stable personal factors that shape real social media reactions, and zero-shot prompting produces simulation rather than surface text matching.

What would settle it

Running the same agents on reaction labels collected from a fresh multi-country participant pool and checking whether their MCC remains above 0.29 while TF-IDF classifiers retain their 0.36 advantage.

read the original abstract

Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM-powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals' reactions to specific content. This study benchmarks LLM-based agents' accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent-persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage-point performance spread. Study 2 employed binary forced-choice (like/dislike) evaluation with chance-corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text-based supervised classifiers using TF-IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero-shot persona-prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero-shot agents is that they require no task-specific training, making their large-scale deployment easy across diverse contexts. Limitations include single-country sampling. Future research should explore multilingual testing and fine-tuning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM agents with survey personas predict individual reactions better than chance but lose to basic TF-IDF classifiers on a 1511-person Serbian sample.

read the letter

The core result is straightforward: zero-shot agents built from real survey personas reach 70.7% accuracy on multi-class reaction prediction and an MCC of 0.29 on binary like/dislike, which is above chance, yet standard TF-IDF classifiers still come out ahead at 0.36 MCC. The scale is the main thing that stands out—120k+ persona-post pairs from 1,511 actual participants tested across 27 models with direct human labels as ground truth. That setup gives a clearer individual-level picture than most prior aggregate survey work on LLM simulation. The two-study design and use of chance-corrected metrics keep the comparison honest, and the practical note that these agents need no task-specific training is worth keeping in mind for anyone running large simulations. They also test enough models to show a 13-point spread, which is useful data. The soft spot is the missing ablation on the personas themselves. Nothing in the reported controls checks whether the 0.29 MCC would hold with a generic prompt that drops the specific individual details, so it remains possible the signal is mostly coming from the LLM reading the post text the same way the TF-IDF baseline does. Single-country sampling is flagged but still limits how far the numbers travel. Per-model variance could use tighter reporting too. This paper is for people who build or evaluate agent simulations of online behavior and for anyone tracking what current LLMs can actually do at scale on social prediction tasks. The empirical comparison is direct enough and the data volume is large enough that it deserves referee time, even if the controls need tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript benchmarks zero-shot LLM agents prompted with survey-derived personas from 1,511 Serbian participants to predict individual reactions (like, dislike, comment, share, no reaction) to social media posts. Across 27 models and 120,000+ agent-persona combinations, Study 1 reports 70.7% multi-class accuracy with a 13-point model spread; Study 2 reports binary like/dislike MCC of 0.29 for agents versus 0.36 for TF-IDF classifiers, concluding that agents capture genuine signal via semantic access rather than uniquely agentic reasoning, with implications for AI deployment and simulation.

Significance. If the comparative results hold, the work supplies large-scale empirical evidence with direct human ground truth that zero-shot persona agents achieve above-chance predictive validity (MCC 0.29) yet are outperformed by simple supervised classifiers, supporting claims about easy deployment risks and simulation utility while highlighting that gains likely stem from pre-trained semantic priors. The two-study design and concrete metrics (70.7% accuracy, MCC values) strengthen the benchmark's contribution to LLM agent evaluation.

major comments (3)

[Study 2] Study 2 methods and results: the claim that the MCC of 0.29 reflects persona-driven capture of stable individual factors (rather than the LLM's generic semantic understanding of post text) is weakened by the absence of an ablation that replaces detailed personas with a generic prompt (e.g., 'You are a social media user') while holding the 27 models, posts, and evaluation protocol fixed. Without this control, the observed signal could be produced by the same text information used by the TF-IDF baseline that reaches MCC 0.36.
[Results] Results (binary evaluation): the reported agent MCC of 0.29 and the 0.07 gap to the classifier lack per-model variance, confidence intervals, or error bars across the 27 LLMs, making it difficult to determine whether the comparative claim is robust or driven by a subset of models.
[Methods] Methods (persona construction): the survey-derived personas are presented as capturing stable individual factors that determine reactions, yet no validation is provided that these factors remain consistent across different posts or time, which is load-bearing for interpreting the agents' predictive signal as persona-specific rather than content-driven.

minor comments (2)

[Abstract] Abstract: the phrase '120K+ Personas of 1511 Humans' should be clarified with the exact total number of unique agent-persona combinations to avoid ambiguity.
[Figures] Figure clarity: the performance spread across models in Study 1 would be easier to interpret with error bars or a table of per-model accuracies rather than a single aggregate figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating revisions where we agree changes are needed to strengthen the manuscript.

read point-by-point responses

Referee: [Study 2] Study 2 methods and results: the claim that the MCC of 0.29 reflects persona-driven capture of stable individual factors (rather than the LLM's generic semantic understanding of post text) is weakened by the absence of an ablation that replaces detailed personas with a generic prompt (e.g., 'You are a social media user') while holding the 27 models, posts, and evaluation protocol fixed. Without this control, the observed signal could be produced by the same text information used by the TF-IDF baseline that reaches MCC 0.36.

Authors: We agree that an ablation with a generic prompt would better isolate the contribution of detailed personas. We will add this control to the revised manuscript by re-running the binary like/dislike task across all 27 models and the same posts using the prompt 'You are a social media user', then report the resulting MCC for direct comparison to the persona-based MCC of 0.29. revision: yes
Referee: [Results] Results (binary evaluation): the reported agent MCC of 0.29 and the 0.07 gap to the classifier lack per-model variance, confidence intervals, or error bars across the 27 LLMs, making it difficult to determine whether the comparative claim is robust or driven by a subset of models.

Authors: We will revise the Results section to include per-model MCC values for all 27 LLMs, accompanied by 95% confidence intervals obtained via bootstrapping. This will allow readers to assess the distribution and confirm that the reported average of 0.29 is robust rather than driven by a small number of models. revision: yes
Referee: [Methods] Methods (persona construction): the survey-derived personas are presented as capturing stable individual factors that determine reactions, yet no validation is provided that these factors remain consistent across different posts or time, which is load-bearing for interpreting the agents' predictive signal as persona-specific rather than content-driven.

Authors: The personas integrate multiple stable constructs (demographics, attitudes, media habits, and personality traits) drawn from established survey instruments. We did not conduct explicit test-retest or cross-post consistency checks. We will expand the Methods and Limitations sections to state this assumption explicitly, reference supporting psychometric literature on trait stability, and identify longitudinal validation as a direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with direct human ground truth

full rationale

This is a pure empirical benchmarking study. The paper collects survey responses from 1,511 humans, derives personas, prompts LLMs in zero-shot fashion to predict reactions to specific posts, and directly compares the resulting predictions (accuracy 70.7%, MCC 0.29) against the held-out human ground truth. A separate TF-IDF supervised classifier is trained and evaluated on the same data for comparison. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps. All reported metrics are computed against external human labels rather than being forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that survey responses yield faithful personas and that LLM outputs reflect simulated reasoning rather than text pattern matching; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Survey-derived personas accurately capture the stable individual factors determining social media reactions.
Invoked to interpret agent accuracy as evidence of behavioral fidelity.

pith-pipeline@v0.9.0 · 5633 in / 1263 out tokens · 50068 ms · 2026-05-13T22:56:51.167893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

J., Bethge, M., & Schulz, E

Akata, E., Schulz, L., Coda-Forno, J., Oh, S. J., Bethge, M., & Schulz, E. (2025). Playing repeated games with large language models. Nature Human Behaviour , 9(7), 1380–1390. https://doi.org/10.1038/s41562-025-02172-y Alipour, S., Galeazzi, A., Sangiorgio, E., Avalle, M., Bojic, L., Cinelli, M., & Quattrociocchi, W. (2024). Cross-platform social dynamics...

work page doi:10.1038/s41562-025-02172-y 2025
[2]

Altera.AL, Ahn, A., Chang, D., Chung, J., Hu, R., Jang, H., Li, X., Lin, D., Luo, T., Sun, Q., Tang, J., Xia, P., Yao, F., & Zeng, W. (2024). Project Sid: Many -agent simulations toward AI civilization. arXiv preprint arXiv:2411.00114. https://arxiv.org/abs/2411.00114 Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023)...

work page doi:10.1098/rsos.240180 2024
[3]

Bojíć, L., Ilić, V., Prodanović, V., & Vuković, V. (2025b). An agent‑based simulation of politicized topics using large language models: Algorithmic personalization and polarization on social media. Chinese Political Science Review . https://doi.org/10.1007/s41111-025-00326-x Cheng, M., Piccardi, T., & Yang, D. (2023). CoMPosT: Characterizing and evaluati...

work page doi:10.1007/s41111-025-00326-x 2023
[4]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., … DeepSeek-AI. (2025). DeepSeek -R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8044), 633–638. https://doi.org/10.1038/s41586-025- 09422-z Hu, Y., S...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025- 2025
[5]

Luo, X., Rechardt, A., Sun, G., Nejad, K

Association for Computational Linguistics. Luo, X., Rechardt, A., Sun, G., Nejad, K. K., Yáñez, F., Yilmaz, B., Lee, K., Cohen, A. O., Borghesani, V., Pashkov, A., Marinazzo, D., Nicholas, J., Salatiello, A., Sucholutsky, I., Minervini, P., Razavi, S., Rocca, R., Yusifov, E., Okalova, T., … Love, B. C. (2025). Large language models surpass human experts i...

work page doi:10.1038/s41562-024-02046-9 2025
[6]

Piao, G., Zhao, Y., Yang, J., Yang, H., Qin, Y., Jia, Y., & Shi, C. (2025). AgentSociety: Large - scale simulation of LLM -based agents for social science. ArXiv preprint, arXiv:2502.17962. Qi, M., Huang, Y., Yao, Y., Li, S., Wang, Z., Chen, L., & Wang, Q. (2024). Is next token prediction sufficient for GPT? Exploration on code logic comprehension. ArXiv ...

work page doi:10.1038/s41562-025-02194-6 2025
[7]

(2026, February 2)

https://doi.org/10.1038/s42256-024- 00976-7 Taylor, J. (2026, February 2). What is Moltbook? The strange new social media site for AI bots. The Guardian. https://www.theguardian.com/technology/2026/feb/02/moltbook-ai- agents-social-media-site-bots-artificial-intelligence Törnberg, P., Valeeva, D., Uitermark, J., & Bail, C. (2023). Simulating social media ...

work page doi:10.1038/s42256-024- 2026
[8]

https://doi.org/10.21105/joss.03139 R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ Turner, B. (2025). AI researchers ran a secret experiment on Reddit users to see if they could change their minds — and the results are creepy . Live Science . https://www.live...

work page doi:10.21105/joss.03139 2024