pith. sign in

arxiv: 2605.17079 · v1 · pith:2C3V4EBRnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· cs.CY

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

Pith reviewed 2026-05-20 15:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords LLM evaluationconsumer simulationsocial mediapublic opinionbenchmarkreaction reconstructionChinese discourse
0
0 comments X

The pith

Even the best LLMs reconstruct fewer than half of the specific reactions real consumers voice on social media.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark that tests whether LLMs can reconstruct the concrete reaction patterns real people express in public discussions. It starts from 1,553 real Chinese social-media topics and turns each possible reaction into dozens of small, rule-checked yes-no questions rather than one overall judgment. Across thirteen frontier models the highest coverage reaches only 47.8 percent of these real criteria, showing that strong results on technical tests do not guarantee accurate prediction of what consumers will actually notice or care about. The work treats consumer simulation as a forecasting task against recorded public discourse instead of abstract preference scoring. If the gap holds, many current uses of LLMs for marketing pre-tests or opinion polling rest on an unproven assumption of social fidelity.

Core claim

ConsumerSimBench demonstrates that frontier LLMs achieve at most 47.8 percent coverage of the atomic, rule-audited reaction criteria drawn from real Chinese social-media discourse, establishing that technical benchmark strength does not translate into faithful reconstruction of crowd-level consumer responses.

What carries the argument

ConsumerSimBench, which converts open reaction reconstruction into a set of auditable binary decisions over 23,122 atomic criteria extracted from 1,553 real topics and four reaction families.

If this is right

  • LLMs remain unreliable for predicting the specific elements consumers will highlight in public discussions.
  • Performance on standard technical benchmarks does not forecast success at socially grounded consumer simulation.
  • Structured reasoning prompts can lower coverage of real reaction criteria.
  • Multi-agent generate-and-reflect pipelines produce modest gains on subsets of the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition method could expose comparable shortfalls when applied to consumer discourse in other languages or platforms.
  • Training or fine-tuning on large collections of annotated real-world reactions may be necessary to close the observed gap.
  • Any production system that uses LLMs for audience testing should add direct checks against recorded public discourse patterns.

Load-bearing premise

The 23,122 atomic criteria accurately and completely represent the reaction patterns that real consumers surface in public discourse on these topics.

What would settle it

A fresh set of human annotations on a new collection of topics that shows model coverage rates rising well above 50 percent or staying consistently below it.

Figures

Figures reproduced from arXiv: 2605.17079 by Jiajun Li, Jianghao Lin, Tianyu Wang.

Figure 1
Figure 1. Figure 1: Overview of CONSUMERSIMBENCH . Left: The 1,553 trending topics span four super￾categories and twenty consumer-facing sub-fields. Right: One task instance. Given a real trending topic, a generator must produce a free-form bundle of consumer comments that collectively cover an audited set of atomic criteria across four reaction families (flashpoints, emotion, praise, critique). understanding the narrative st… view at source ↗
Figure 2
Figure 2. Figure 2: CONSUMERSIMBENCH construction and evaluation pipeline. Public trend signals are curated into topic–event records and abstracted observed reactions; these reactions are converted into four families of atomic reaction criteria, hardened, and finally used for pointwise coverage judging of model-generated comments. The final score gives equal weight to the four reaction families. 3 CONSUMERSIMBENCH Dataset Des… view at source ↗
Figure 3
Figure 3. Figure 3: Main results. Left: overall leaderboard on the final full benchmark. Right: section-wise [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ranked sub-field gap plot across 20 consumer-facing domains. Rows are sorted by Gemini-3.1-Pro score; horizon￾tal spans show the cross-model range. 0 10 20 30 40 50 Score (%) Gemini 3.1 Pro Qwen 3.5 GPT-5.2 Claude-Sonnet-4.6 -1.7 -5.0 -3.5 -4.0 Naive prompt ConsumerSCF Δ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Direct SCF prompting does not improve coverage. Arrows show the change from naive prompting (blue) to CONSUMERSCF (red). We classify all 197,790 missed criteria across the 13 generators by parsing judge explanations into an 8- category taxonomy (Appendix B; the categorization is derived from keyword patterns in judge text plus a 50-sample manual audit). Using this taxonomy, we find that missing-content pat… view at source ↗
Figure 6
Figure 6. Figure 6: Second-platform YouTube pilot. Left: overall lollipop leaderboard on 100 English trending [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ConsumerSimBench, a benchmark for evaluating LLMs on reconstructing crowd-level consumer reactions from 1,553 real Chinese social-media topics. It decomposes reactions into 23,122 atomic, rule-audited criteria across four families and scores models via pointwise yes-no decisions rather than holistic judgments. This yields three-judge agreement of 92.1% (up from 65.8%) with 98.4% alignment to human-majority labels. Across 13 frontier models, Gemini-3.1-Pro covers only 47.8% of criteria while GPT-5.2 and Claude-4.6 perform worse despite strong technical benchmark results; structured reasoning prompts decrease coverage while a generate-reflect pipeline modestly improves one model. The work frames consumer simulation as forecasting over real public-discourse reactions.

Significance. If the criteria faithfully and exhaustively capture observable reactions, the results demonstrate a clear gap between frontier LLMs' technical capabilities and their ability to anticipate concrete consumer concerns in high-context settings. This has direct implications for marketing pre-testing, opinion simulation, and AI systems that must align with diverse human preferences. The shift to auditable pointwise decisions and grounding in external social-media data is a methodological advance over preference-based judges.

major comments (2)
  1. [§3 (Benchmark Construction)] §3 (Benchmark Construction): The 23,122 criteria are presented as rule-audited and comprehensive, yet the manuscript provides no inter-auditor reliability statistics, no quantification of reactions discarded by the auditing rules, and no held-out coverage check against additional posts. This assumption is load-bearing for the headline 47.8% coverage result and the claim of a gap between technical and consumer-intuition performance.
  2. [Results §5.2 and Table 2] Results §5.2 and Table 2: The 47.8% coverage for Gemini-3.1-Pro is reported without statistical tests or confidence intervals on the per-criterion coverage rates; given the large number of criteria, small differences in auditing rules could materially shift the ranking and the gap to other models.
minor comments (2)
  1. [§3] The four reaction families are introduced without a clear taxonomy or example criteria in the main text; moving one or two concrete examples to the main body would improve readability.
  2. [Figure 3] Figure 3 (prompting ablation) would benefit from reporting the exact subset size and variance across runs for the MiMo-V2.5-Pro improvement from 32.9% to 37.6%.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The 23,122 criteria are presented as rule-audited and comprehensive, yet the manuscript provides no inter-auditor reliability statistics, no quantification of reactions discarded by the auditing rules, and no held-out coverage check against additional posts. This assumption is load-bearing for the headline 47.8% coverage result and the claim of a gap between technical and consumer-intuition performance.

    Authors: We agree that these validation details are important for establishing the robustness of the benchmark. In the revised manuscript we will add inter-auditor reliability statistics (e.g., Cohen’s kappa) for the rule-auditing process. We will also report the number and proportion of candidate reactions discarded by the auditing rules together with the primary reasons for exclusion. Finally, we will include a held-out coverage analysis on an additional set of posts drawn from the same source distribution and report the resulting coverage figures. These additions will appear in the updated §3. revision: yes

  2. Referee: [Results §5.2 and Table 2] Results §5.2 and Table 2: The 47.8% coverage for Gemini-3.1-Pro is reported without statistical tests or confidence intervals on the per-criterion coverage rates; given the large number of criteria, small differences in auditing rules could materially shift the ranking and the gap to other models.

    Authors: We concur that statistical support is needed to substantiate the reported coverage figures and model rankings. In the revision we will augment §5.2 and Table 2 with bootstrap-derived confidence intervals for each model’s coverage rate and will include pairwise statistical comparisons (e.g., McNemar’s test on per-criterion decisions) to assess whether observed differences are significant. These analyses will be added to the results section and the table. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark anchored in external social-media data and human labels

full rationale

The paper constructs ConsumerSimBench directly from 1,553 real Chinese social-media topics, decomposes them into 23,122 atomic rule-audited criteria across four reaction families, and measures model coverage against these externally sourced points. Validation reports 98.4% agreement between pointwise judge decisions and human-majority labels, with no fitted parameters, self-defined quantities, or self-citation chains invoked to justify the criteria or the 47.8% coverage result. The central claim is an empirical comparison against independent real-world discourse rather than a reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the collected topics and the assumption that rule-audited atomic criteria faithfully encode real consumer reactions; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Real social-media topics and rule-audited criteria can serve as ground truth for consumer reactions.
    This premise is required for the coverage percentages to be interpreted as measures of model fidelity to actual consumer behavior.

pith-pipeline@v0.9.0 · 5808 in / 1368 out tokens · 60868 ms · 2026-05-20T15:26:50.784714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 9 internal anchors

  1. [1]

    Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024

    Diyi Yang, Caleb Ziems, William Held, Omar Shaikh, Michael S Bernstein, and John Mitchell. Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024

  2. [2]

    AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

    Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm- driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025

  3. [3]

    Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system

    Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, et al. Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2930–2949, 2025

  4. [4]

    User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

  5. [5]

    Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior

    Man-Lin Chu et al. Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior. InIEEE International Conference on e-Business Engineering (ICEBE), 2025

  6. [6]

    From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

    Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, et al. From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

  7. [7]

    Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024

    Yao Qu and Jue Wang. Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024

  8. [8]

    Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

    Qian Wang, Jiaying Wu, Zichen Jiang, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, and Bingsheng He. Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

  9. [9]

    Theory of mind in large language models: Assessment and enhancement

    Ruirui Chen, Weifeng Jiang, Chengwei Qin, and Cheston Tan. Theory of mind in large language models: Assessment and enhancement. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31539–31558, Vienna, A...

  10. [10]

    Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems

    Luyu Chen, Quanyu Dai, Zeyu Zhang, Xueyang Feng, Mingyu Zhang, Pengcheng Tang, Xu Chen, Yue Zhu, and Zhenhua Dong. Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems. InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 133–142, New York, NY , USA, 2025. Association for Computing Machinery

  11. [11]

    Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

    Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

  12. [12]

    Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025

    Fernando Miranda and Pedro Paulo Balbi. Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025

  13. [13]

    Ljubisa Bojic, Alexander Felfernig, Bojana Dinic, Velibor Ilic, Achim Rettinger, Vera Mevorah, and Damian Trilling. Llm agents predict social media reactions but do not outperform text classifiers: Benchmarking simulation accuracy using 120k+ personas of 1511 humans.arXiv preprint arXiv:2604.19787, 2026

  14. [14]

    Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction

    Nils Schwager, Simon Münker, Alistair Plum, and Achim Rettinger. Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction. In The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026), pages 208–221, 2026. 10

  15. [15]

    Smp challenge: An overview and analysis of social media prediction challenge

    Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, and Jiebo Luo. Smp challenge: An overview and analysis of social media prediction challenge. In Proceedings of the 31st ACM International Conference on Multimedia, 2023

  16. [16]

    SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

    Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

  17. [17]

    Charactereval: A chinese benchmark for role-playing conversational agent evaluation

    Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024

  18. [18]

    Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024

    Jiaheng Liu, Zehao Ni, Haoran Que, Tao Sun, Zekun Wang, Jian Yang, Jiakai Wang, Hongcheng Guo, Zhongyuan Peng, Ge Zhang, et al. Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024

  19. [19]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594, 2024

  20. [20]

    Humans or LLMs as the judge? a study on judgement bias

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327. Association for Computational Linguistics, 2024

  21. [21]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  22. [22]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  23. [23]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  24. [24]

    Fantom: A benchmark for stress-testing machine theory of mind in interactions

    Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

  25. [25]

    Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

    Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024

  26. [26]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23. Association for Computing Machinery, 2023

  27. [27]

    Arriaga, and Adam Tauman Kalai

    Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

  28. [28]

    Large language models as psychological simulators: A methodological guide

    Zhicheng Lin. Large language models as psychological simulators: A methodological guide. Advances in Methods and Practices in Psychological Science, 9(1):25152459251410153, 2026

  29. [29]

    A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies

    Weihong Qi, Fan Huang, Jisun An, and Haewoon Kwak. A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies. arXiv preprint arXiv:2506.21587, 2025

  30. [30]

    Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026

    Stephan Ludwig, Peter J Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin, Dhruv Grewal, and Lan Du. Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026

  31. [31]

    Large language models for market research: A data-augmentation approach.Marketing Science, 2026

    Mengxin Wang, Dennis J Zhang, and Heng Zhang. Large language models for market research: A data-augmentation approach.Marketing Science, 2026

  32. [32]

    Predicting results of social science experiments using large language models.Preprint, 2024

    Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

  33. [33]

    R., Liu, R., Richardson, S

    Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234, 2025

  34. [34]

    Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024

    Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, and Robert West. Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024

  35. [35]

    Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025

    Feng Xiao and XT XiaoTian Wang. Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025

  36. [36]

    Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025

    Oliver Slumbers, Joel Z Leibo, and Marco A Janssen. Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025

  37. [37]

    Oasis: Open agents social interaction simulations on one million agents

    Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agent social interaction simulations with one million agents.arXiv preprint arXiv:2411.11581, 2024

  38. [38]

    Sotopia: Interactive evaluation for social intelligence in language agents

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2023

  39. [39]

    Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025

    Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, and Minlie Huang. Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025

  40. [40]

    Infusing Theory of Mind into Socially Intelligent LLM Agents

    EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, and Vered Shwartz. Infusing theory of mind into socially intelligent llm agents.arXiv preprint arXiv:2509.22887, 2025

  41. [41]

    A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

    Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda- Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltet˝o, et al. A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

  42. [42]

    Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

    Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

  43. [43]

    Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping

    Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping. InFirst Workshop on Multi-Turn Interactions in Large Language Models

  44. [44]

    Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023

    James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023

  45. [45]

    The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025

    Arya Agarwal. The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025. 12

  46. [46]

    Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

    Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

  47. [47]

    Econagent: large language model-empowered agents for simulating macroeconomic activities

    Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024

  48. [48]

    Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems, 37:137010–137045, 2024

  49. [49]

    Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

    Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, and Leo Huang. Sell more, play less: Bench- marking llm realistic selling skill.arXiv preprint arXiv:2604.07054, 2026

  50. [50]

    Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025

    Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, et al. Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025

  51. [51]

    Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025

    Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, and Michiel A Bakker. Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025

  52. [52]

    Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of ai assistants? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35200–35278, 2025

  53. [53]

    arXiv preprint arXiv:2601.17087 , year=

    Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations.arXiv preprint arXiv:2601.17087, 2026

  54. [54]

    War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

    Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

  55. [55]

    S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

    Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents.arXiv preprint arXiv:2307.14984, 2023

  56. [56]

    Gensim: A general social simulation platform with large language model based agents

    Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, et al. Gensim: A general social simulation platform with large language model based agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...

  57. [57]

    Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025

    Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, et al. Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025

  58. [58]

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InICLR, 2024

  59. [59]

    Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops

    Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, and Bingsheng He. Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036, 2025

  60. [60]

    Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025

    Jing Liu, Xinxing Ren, Yanmeng Xu, and Zekun Guo. Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025. 13

  61. [61]

    Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

    Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

  62. [62]

    From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research

    Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, et al. From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research

  63. [63]

    Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

    James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

  64. [64]

    Social iqa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019

  65. [65]

    Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics

    Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, et al. Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8397–8437, 2025

  66. [66]

    Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025

    Min Zeng. Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025

  67. [67]

    Socialbench: Sociality evaluation of role-playing conversational agents

    Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Gao Xing, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, and Fei Huang. Socialbench: Sociality evaluation of role-playing conversational agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2108–2126, 2024

  68. [68]

    The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas

    Giovanni Franco Gabriel Marraffini, Andrés Cotton, Noe Fabian Hsueh, Axel Fridman, Juan Wisznia, and Luciano Del Corro. The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21950–21959, 2024

  69. [69]

    Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning

    Anonymous. Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning. InSubmitted to The Fourteenth International Conference on Learning Representa- tions, 2025. under review

  70. [70]

    Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025

    Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, and Xing Xie. Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025

  71. [71]

    Emobench: Evaluating the emotional intelligence of large language models

    Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024

  72. [72]

    Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

    Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

  73. [73]

    Aligning {ai} with shared human values

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. InInternational Conference on Learning Representations, 2021

  74. [74]

    Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023. 14

  75. [75]

    Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews

    Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1840–1873, 2024

  76. [76]

    PersonaGym: Evaluating persona agents and LLMs

    Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik R Narasimhan, and Vishvak Murahari. PersonaGym: Evaluating persona agents and LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Lin- guistics: EMN...

  77. [77]

    Klinkert, Steph Buongiorno, and Corey Clark

    Lawrence J. Klinkert, Steph Buongiorno, and Corey Clark. Evaluating the efficacy of llms to emulate realistic human personalities.Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 20(1):65–75, Nov. 2024

  78. [78]

    Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025

    Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, et al. Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025

  79. [79]

    AgentSense: Benchmarking social intelligence of language agents through interactive scenarios

    Xinyi Mou, Jingcong Liang, Jiayu Lin, Xinnong Zhang, Xiawei Liu, Shiyue Yang, Rong Ye, Lei Chen, Haoyu Kuang, Xuanjing Huang, and Zhongyu Wei. AgentSense: Benchmarking social intelligence of language agents through interactive scenarios. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas C...

  80. [80]

    Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar...

Showing first 80 references.