Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
Pith reviewed 2026-05-20 15:26 UTC · model grok-4.3
The pith
Even the best LLMs reconstruct fewer than half of the specific reactions real consumers voice on social media.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConsumerSimBench demonstrates that frontier LLMs achieve at most 47.8 percent coverage of the atomic, rule-audited reaction criteria drawn from real Chinese social-media discourse, establishing that technical benchmark strength does not translate into faithful reconstruction of crowd-level consumer responses.
What carries the argument
ConsumerSimBench, which converts open reaction reconstruction into a set of auditable binary decisions over 23,122 atomic criteria extracted from 1,553 real topics and four reaction families.
If this is right
- LLMs remain unreliable for predicting the specific elements consumers will highlight in public discussions.
- Performance on standard technical benchmarks does not forecast success at socially grounded consumer simulation.
- Structured reasoning prompts can lower coverage of real reaction criteria.
- Multi-agent generate-and-reflect pipelines produce modest gains on subsets of the benchmark.
Where Pith is reading between the lines
- The same decomposition method could expose comparable shortfalls when applied to consumer discourse in other languages or platforms.
- Training or fine-tuning on large collections of annotated real-world reactions may be necessary to close the observed gap.
- Any production system that uses LLMs for audience testing should add direct checks against recorded public discourse patterns.
Load-bearing premise
The 23,122 atomic criteria accurately and completely represent the reaction patterns that real consumers surface in public discourse on these topics.
What would settle it
A fresh set of human annotations on a new collection of topics that shows model coverage rates rising well above 50 percent or staying consistently below it.
Figures
read the original abstract
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ConsumerSimBench, a benchmark for evaluating LLMs on reconstructing crowd-level consumer reactions from 1,553 real Chinese social-media topics. It decomposes reactions into 23,122 atomic, rule-audited criteria across four families and scores models via pointwise yes-no decisions rather than holistic judgments. This yields three-judge agreement of 92.1% (up from 65.8%) with 98.4% alignment to human-majority labels. Across 13 frontier models, Gemini-3.1-Pro covers only 47.8% of criteria while GPT-5.2 and Claude-4.6 perform worse despite strong technical benchmark results; structured reasoning prompts decrease coverage while a generate-reflect pipeline modestly improves one model. The work frames consumer simulation as forecasting over real public-discourse reactions.
Significance. If the criteria faithfully and exhaustively capture observable reactions, the results demonstrate a clear gap between frontier LLMs' technical capabilities and their ability to anticipate concrete consumer concerns in high-context settings. This has direct implications for marketing pre-testing, opinion simulation, and AI systems that must align with diverse human preferences. The shift to auditable pointwise decisions and grounding in external social-media data is a methodological advance over preference-based judges.
major comments (2)
- [§3 (Benchmark Construction)] §3 (Benchmark Construction): The 23,122 criteria are presented as rule-audited and comprehensive, yet the manuscript provides no inter-auditor reliability statistics, no quantification of reactions discarded by the auditing rules, and no held-out coverage check against additional posts. This assumption is load-bearing for the headline 47.8% coverage result and the claim of a gap between technical and consumer-intuition performance.
- [Results §5.2 and Table 2] Results §5.2 and Table 2: The 47.8% coverage for Gemini-3.1-Pro is reported without statistical tests or confidence intervals on the per-criterion coverage rates; given the large number of criteria, small differences in auditing rules could materially shift the ranking and the gap to other models.
minor comments (2)
- [§3] The four reaction families are introduced without a clear taxonomy or example criteria in the main text; moving one or two concrete examples to the main body would improve readability.
- [Figure 3] Figure 3 (prompting ablation) would benefit from reporting the exact subset size and variance across runs for the MiMo-V2.5-Pro improvement from 32.9% to 37.6%.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The 23,122 criteria are presented as rule-audited and comprehensive, yet the manuscript provides no inter-auditor reliability statistics, no quantification of reactions discarded by the auditing rules, and no held-out coverage check against additional posts. This assumption is load-bearing for the headline 47.8% coverage result and the claim of a gap between technical and consumer-intuition performance.
Authors: We agree that these validation details are important for establishing the robustness of the benchmark. In the revised manuscript we will add inter-auditor reliability statistics (e.g., Cohen’s kappa) for the rule-auditing process. We will also report the number and proportion of candidate reactions discarded by the auditing rules together with the primary reasons for exclusion. Finally, we will include a held-out coverage analysis on an additional set of posts drawn from the same source distribution and report the resulting coverage figures. These additions will appear in the updated §3. revision: yes
-
Referee: [Results §5.2 and Table 2] Results §5.2 and Table 2: The 47.8% coverage for Gemini-3.1-Pro is reported without statistical tests or confidence intervals on the per-criterion coverage rates; given the large number of criteria, small differences in auditing rules could materially shift the ranking and the gap to other models.
Authors: We concur that statistical support is needed to substantiate the reported coverage figures and model rankings. In the revision we will augment §5.2 and Table 2 with bootstrap-derived confidence intervals for each model’s coverage rate and will include pairwise statistical comparisons (e.g., McNemar’s test on per-criterion decisions) to assess whether observed differences are significant. These analyses will be added to the results section and the table. revision: yes
Circularity Check
No circularity: benchmark anchored in external social-media data and human labels
full rationale
The paper constructs ConsumerSimBench directly from 1,553 real Chinese social-media topics, decomposes them into 23,122 atomic rule-audited criteria across four reaction families, and measures model coverage against these externally sourced points. Validation reports 98.4% agreement between pointwise judge decisions and human-majority labels, with no fitted parameters, self-defined quantities, or self-citation chains invoked to justify the criteria or the 47.8% coverage result. The central claim is an empirical comparison against independent real-world discourse rather than a reduction to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real social-media topics and rule-audited criteria can serve as ground truth for consumer reactions.
Reference graph
Works this paper leans on
-
[1]
Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024
Diyi Yang, Caleb Ziems, William Held, Omar Shaikh, Michael S Bernstein, and John Mitchell. Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024
-
[2]
Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm- driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, et al. Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2930–2949, 2025
work page 2025
-
[4]
Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025
work page 2025
-
[5]
Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior
Man-Lin Chu et al. Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior. InIEEE International Conference on e-Business Engineering (ICEBE), 2025
work page 2025
-
[6]
Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, et al. From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024
-
[7]
Yao Qu and Jue Wang. Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024
work page 2024
-
[8]
Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025
Qian Wang, Jiaying Wu, Zichen Jiang, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, and Bingsheng He. Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025
-
[9]
Theory of mind in large language models: Assessment and enhancement
Ruirui Chen, Weifeng Jiang, Chengwei Qin, and Cheston Tan. Theory of mind in large language models: Assessment and enhancement. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31539–31558, Vienna, A...
work page 2025
-
[10]
Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems
Luyu Chen, Quanyu Dai, Zeyu Zhang, Xueyang Feng, Mingyu Zhang, Pengcheng Tang, Xu Chen, Yue Zhu, and Zhenhua Dong. Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems. InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 133–142, New York, NY , USA, 2025. Association for Computing Machinery
work page 2025
-
[11]
Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025
-
[12]
Fernando Miranda and Pedro Paulo Balbi. Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025
work page 2025
-
[13]
Ljubisa Bojic, Alexander Felfernig, Bojana Dinic, Velibor Ilic, Achim Rettinger, Vera Mevorah, and Damian Trilling. Llm agents predict social media reactions but do not outperform text classifiers: Benchmarking simulation accuracy using 120k+ personas of 1511 humans.arXiv preprint arXiv:2604.19787, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Nils Schwager, Simon Münker, Alistair Plum, and Achim Rettinger. Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction. In The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026), pages 208–221, 2026. 10
work page 2026
-
[15]
Smp challenge: An overview and analysis of social media prediction challenge
Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, and Jiebo Luo. Smp challenge: An overview and analysis of social media prediction challenge. In Proceedings of the 31st ACM International Conference on Multimedia, 2023
work page 2023
-
[16]
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Charactereval: A chinese benchmark for role-playing conversational agent evaluation
Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024
work page 2024
-
[18]
Jiaheng Liu, Zehao Ni, Haoran Que, Tao Sun, Zekun Wang, Jian Yang, Jiakai Wang, Hongcheng Guo, Zhongyuan Peng, Ge Zhang, et al. Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024
work page 2024
-
[19]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Humans or LLMs as the judge? a study on judgement bias
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327. Association for Computational Linguistics, 2024
work page 2024
-
[21]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024
work page 2024
-
[22]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Fantom: A benchmark for stress-testing machine theory of mind in interactions
Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023
work page 2023
-
[25]
Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024
work page 2024
-
[26]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23. Association for Computing Machinery, 2023
work page 2023
-
[27]
Arriaga, and Adam Tauman Kalai
Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...
work page 2023
-
[28]
Large language models as psychological simulators: A methodological guide
Zhicheng Lin. Large language models as psychological simulators: A methodological guide. Advances in Methods and Practices in Psychological Science, 9(1):25152459251410153, 2026
work page 2026
-
[29]
Weihong Qi, Fan Huang, Jisun An, and Haewoon Kwak. A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies. arXiv preprint arXiv:2506.21587, 2025
-
[30]
Stephan Ludwig, Peter J Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin, Dhruv Grewal, and Lan Du. Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026
-
[31]
Large language models for market research: A data-augmentation approach.Marketing Science, 2026
Mengxin Wang, Dennis J Zhang, and Heng Zhang. Large language models for market research: A data-augmentation approach.Marketing Science, 2026
work page 2026
-
[32]
Predicting results of social science experiments using large language models.Preprint, 2024
Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024
work page 2024
-
[33]
Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234, 2025
-
[34]
Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, and Robert West. Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024
-
[35]
Feng Xiao and XT XiaoTian Wang. Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025
work page 2025
-
[36]
Oliver Slumbers, Joel Z Leibo, and Marco A Janssen. Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025
-
[37]
Oasis: Open agents social interaction simulations on one million agents
Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agent social interaction simulations with one million agents.arXiv preprint arXiv:2411.11581, 2024
-
[38]
Sotopia: Interactive evaluation for social intelligence in language agents
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2023
-
[39]
Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025
Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, and Minlie Huang. Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025
-
[40]
Infusing Theory of Mind into Socially Intelligent LLM Agents
EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, and Vered Shwartz. Infusing theory of mind into socially intelligent llm agents.arXiv preprint arXiv:2509.22887, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025
Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda- Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltet˝o, et al. A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025
work page 2025
-
[42]
Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025
-
[43]
Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping
Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping. InFirst Workshop on Multi-Turn Interactions in Large Language Models
-
[44]
Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023
James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023
work page 2023
-
[45]
Arya Agarwal. The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025. 12
work page 2025
-
[46]
Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025
-
[47]
Econagent: large language model-empowered agents for simulating macroeconomic activities
Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024
work page 2024
-
[48]
Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems, 37:137010–137045, 2024
work page 2024
-
[49]
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, and Leo Huang. Sell more, play less: Bench- marking llm realistic selling skill.arXiv preprint arXiv:2604.07054, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, et al. Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025
-
[51]
Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025
Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, and Michiel A Bakker. Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025
-
[52]
Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of ai assistants? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35200–35278, 2025
work page 2025
-
[53]
arXiv preprint arXiv:2601.17087 , year=
Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations.arXiv preprint arXiv:2601.17087, 2026
-
[54]
Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023
-
[55]
S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents
Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents.arXiv preprint arXiv:2307.14984, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Gensim: A general social simulation platform with large language model based agents
Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, et al. Gensim: A general social simulation platform with large language model based agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...
work page 2025
-
[57]
Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, et al. Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025
-
[58]
Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InICLR, 2024
work page 2024
-
[59]
Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops
Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, and Bingsheng He. Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036, 2025
work page 2025
-
[60]
Jing Liu, Xinxing Ren, Yanmeng Xu, and Zekun Guo. Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025. 13
-
[61]
Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024
work page 2024
-
[62]
Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, et al. From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research
-
[63]
James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024
work page 2024
-
[64]
Social iqa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019
work page 2019
-
[65]
Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, et al. Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8397–8437, 2025
work page 2025
-
[66]
Min Zeng. Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025
-
[67]
Socialbench: Sociality evaluation of role-playing conversational agents
Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Gao Xing, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, and Fei Huang. Socialbench: Sociality evaluation of role-playing conversational agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2108–2126, 2024
work page 2024
-
[68]
The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas
Giovanni Franco Gabriel Marraffini, Andrés Cotton, Noe Fabian Hsueh, Axel Fridman, Juan Wisznia, and Luciano Del Corro. The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21950–21959, 2024
work page 2024
-
[69]
Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning
Anonymous. Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning. InSubmitted to The Fourteenth International Conference on Learning Representa- tions, 2025. under review
work page 2025
-
[70]
Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, and Xing Xie. Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025
-
[71]
Emobench: Evaluating the emotional intelligence of large language models
Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024
work page 2024
-
[72]
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025
-
[73]
Aligning {ai} with shared human values
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. InInternational Conference on Learning Representations, 2021
work page 2021
-
[74]
Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023
Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023. 14
-
[75]
Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews
Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1840–1873, 2024
work page 2024
-
[76]
PersonaGym: Evaluating persona agents and LLMs
Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik R Narasimhan, and Vishvak Murahari. PersonaGym: Evaluating persona agents and LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Lin- guistics: EMN...
work page 2025
-
[77]
Klinkert, Steph Buongiorno, and Corey Clark
Lawrence J. Klinkert, Steph Buongiorno, and Corey Clark. Evaluating the efficacy of llms to emulate realistic human personalities.Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 20(1):65–75, Nov. 2024
work page 2024
-
[78]
Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, et al. Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025
-
[79]
AgentSense: Benchmarking social intelligence of language agents through interactive scenarios
Xinyi Mou, Jingcong Liang, Jiayu Lin, Xinnong Zhang, Xiawei Liu, Shiyue Yang, Rong Ye, Lei Chen, Haoyu Kuang, Xuanjing Huang, and Zhongyu Wei. AgentSense: Benchmarking social intelligence of language agents through interactive scenarios. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas C...
work page 2025
-
[80]
Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.