Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

Jiajun Li; Jianghao Lin; Tianyu Wang

arxiv: 2605.17079 · v1 · pith:2C3V4EBRnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· cs.CY

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

Tianyu Wang , Jiajun Li , Jianghao Lin This is my paper

Pith reviewed 2026-05-20 15:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords LLM evaluationconsumer simulationsocial mediapublic opinionbenchmarkreaction reconstructionChinese discourse

0 comments

The pith

Even the best LLMs reconstruct fewer than half of the specific reactions real consumers voice on social media.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark that tests whether LLMs can reconstruct the concrete reaction patterns real people express in public discussions. It starts from 1,553 real Chinese social-media topics and turns each possible reaction into dozens of small, rule-checked yes-no questions rather than one overall judgment. Across thirteen frontier models the highest coverage reaches only 47.8 percent of these real criteria, showing that strong results on technical tests do not guarantee accurate prediction of what consumers will actually notice or care about. The work treats consumer simulation as a forecasting task against recorded public discourse instead of abstract preference scoring. If the gap holds, many current uses of LLMs for marketing pre-tests or opinion polling rest on an unproven assumption of social fidelity.

Core claim

ConsumerSimBench demonstrates that frontier LLMs achieve at most 47.8 percent coverage of the atomic, rule-audited reaction criteria drawn from real Chinese social-media discourse, establishing that technical benchmark strength does not translate into faithful reconstruction of crowd-level consumer responses.

What carries the argument

ConsumerSimBench, which converts open reaction reconstruction into a set of auditable binary decisions over 23,122 atomic criteria extracted from 1,553 real topics and four reaction families.

If this is right

LLMs remain unreliable for predicting the specific elements consumers will highlight in public discussions.
Performance on standard technical benchmarks does not forecast success at socially grounded consumer simulation.
Structured reasoning prompts can lower coverage of real reaction criteria.
Multi-agent generate-and-reflect pipelines produce modest gains on subsets of the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition method could expose comparable shortfalls when applied to consumer discourse in other languages or platforms.
Training or fine-tuning on large collections of annotated real-world reactions may be necessary to close the observed gap.
Any production system that uses LLMs for audience testing should add direct checks against recorded public discourse patterns.

Load-bearing premise

The 23,122 atomic criteria accurately and completely represent the reaction patterns that real consumers surface in public discourse on these topics.

What would settle it

A fresh set of human annotations on a new collection of topics that shows model coverage rates rising well above 50 percent or staying consistently below it.

Figures

Figures reproduced from arXiv: 2605.17079 by Jiajun Li, Jianghao Lin, Tianyu Wang.

**Figure 1.** Figure 1: Overview of CONSUMERSIMBENCH . Left: The 1,553 trending topics span four supercategories and twenty consumer-facing sub-fields. Right: One task instance. Given a real trending topic, a generator must produce a free-form bundle of consumer comments that collectively cover an audited set of atomic criteria across four reaction families (flashpoints, emotion, praise, critique). understanding the narrative st… view at source ↗

**Figure 2.** Figure 2: CONSUMERSIMBENCH construction and evaluation pipeline. Public trend signals are curated into topic–event records and abstracted observed reactions; these reactions are converted into four families of atomic reaction criteria, hardened, and finally used for pointwise coverage judging of model-generated comments. The final score gives equal weight to the four reaction families. 3 CONSUMERSIMBENCH Dataset Des… view at source ↗

**Figure 3.** Figure 3: Main results. Left: overall leaderboard on the final full benchmark. Right: section-wise [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Ranked sub-field gap plot across 20 consumer-facing domains. Rows are sorted by Gemini-3.1-Pro score; horizontal spans show the cross-model range. 0 10 20 30 40 50 Score (%) Gemini 3.1 Pro Qwen 3.5 GPT-5.2 Claude-Sonnet-4.6 -1.7 -5.0 -3.5 -4.0 Naive prompt ConsumerSCF Δ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Direct SCF prompting does not improve coverage. Arrows show the change from naive prompting (blue) to CONSUMERSCF (red). We classify all 197,790 missed criteria across the 13 generators by parsing judge explanations into an 8- category taxonomy (Appendix B; the categorization is derived from keyword patterns in judge text plus a 50-sample manual audit). Using this taxonomy, we find that missing-content pat… view at source ↗

**Figure 6.** Figure 6: Second-platform YouTube pilot. Left: overall lollipop leaderboard on 100 English trending [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ConsumerSimBench, a benchmark for evaluating LLMs on reconstructing crowd-level consumer reactions from 1,553 real Chinese social-media topics. It decomposes reactions into 23,122 atomic, rule-audited criteria across four families and scores models via pointwise yes-no decisions rather than holistic judgments. This yields three-judge agreement of 92.1% (up from 65.8%) with 98.4% alignment to human-majority labels. Across 13 frontier models, Gemini-3.1-Pro covers only 47.8% of criteria while GPT-5.2 and Claude-4.6 perform worse despite strong technical benchmark results; structured reasoning prompts decrease coverage while a generate-reflect pipeline modestly improves one model. The work frames consumer simulation as forecasting over real public-discourse reactions.

Significance. If the criteria faithfully and exhaustively capture observable reactions, the results demonstrate a clear gap between frontier LLMs' technical capabilities and their ability to anticipate concrete consumer concerns in high-context settings. This has direct implications for marketing pre-testing, opinion simulation, and AI systems that must align with diverse human preferences. The shift to auditable pointwise decisions and grounding in external social-media data is a methodological advance over preference-based judges.

major comments (2)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The 23,122 criteria are presented as rule-audited and comprehensive, yet the manuscript provides no inter-auditor reliability statistics, no quantification of reactions discarded by the auditing rules, and no held-out coverage check against additional posts. This assumption is load-bearing for the headline 47.8% coverage result and the claim of a gap between technical and consumer-intuition performance.
[Results §5.2 and Table 2] Results §5.2 and Table 2: The 47.8% coverage for Gemini-3.1-Pro is reported without statistical tests or confidence intervals on the per-criterion coverage rates; given the large number of criteria, small differences in auditing rules could materially shift the ranking and the gap to other models.

minor comments (2)

[§3] The four reaction families are introduced without a clear taxonomy or example criteria in the main text; moving one or two concrete examples to the main body would improve readability.
[Figure 3] Figure 3 (prompting ablation) would benefit from reporting the exact subset size and variance across runs for the MiMo-V2.5-Pro improvement from 32.9% to 37.6%.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The 23,122 criteria are presented as rule-audited and comprehensive, yet the manuscript provides no inter-auditor reliability statistics, no quantification of reactions discarded by the auditing rules, and no held-out coverage check against additional posts. This assumption is load-bearing for the headline 47.8% coverage result and the claim of a gap between technical and consumer-intuition performance.

Authors: We agree that these validation details are important for establishing the robustness of the benchmark. In the revised manuscript we will add inter-auditor reliability statistics (e.g., Cohen’s kappa) for the rule-auditing process. We will also report the number and proportion of candidate reactions discarded by the auditing rules together with the primary reasons for exclusion. Finally, we will include a held-out coverage analysis on an additional set of posts drawn from the same source distribution and report the resulting coverage figures. These additions will appear in the updated §3. revision: yes
Referee: [Results §5.2 and Table 2] Results §5.2 and Table 2: The 47.8% coverage for Gemini-3.1-Pro is reported without statistical tests or confidence intervals on the per-criterion coverage rates; given the large number of criteria, small differences in auditing rules could materially shift the ranking and the gap to other models.

Authors: We concur that statistical support is needed to substantiate the reported coverage figures and model rankings. In the revision we will augment §5.2 and Table 2 with bootstrap-derived confidence intervals for each model’s coverage rate and will include pairwise statistical comparisons (e.g., McNemar’s test on per-criterion decisions) to assess whether observed differences are significant. These analyses will be added to the results section and the table. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark anchored in external social-media data and human labels

full rationale

The paper constructs ConsumerSimBench directly from 1,553 real Chinese social-media topics, decomposes them into 23,122 atomic rule-audited criteria across four reaction families, and measures model coverage against these externally sourced points. Validation reports 98.4% agreement between pointwise judge decisions and human-majority labels, with no fitted parameters, self-defined quantities, or self-citation chains invoked to justify the criteria or the 47.8% coverage result. The central claim is an empirical comparison against independent real-world discourse rather than a reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the collected topics and the assumption that rule-audited atomic criteria faithfully encode real consumer reactions; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Real social-media topics and rule-audited criteria can serve as ground truth for consumer reactions.
This premise is required for the coverage percentages to be interpreted as measures of model fidelity to actual consumer behavior.

pith-pipeline@v0.9.0 · 5808 in / 1368 out tokens · 60868 ms · 2026-05-20T15:26:50.784714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 9 internal anchors

[1]

Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024

Diyi Yang, Caleb Ziems, William Held, Omar Shaikh, Michael S Bernstein, and John Mitchell. Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024

work page arXiv 2024
[2]

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm- driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system

Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, et al. Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2930–2949, 2025

work page 2025
[4]

User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

work page 2025
[5]

Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior

Man-Lin Chu et al. Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior. InIEEE International Conference on e-Business Engineering (ICEBE), 2025

work page 2025
[6]

From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, et al. From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

work page arXiv 2024
[7]

Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024

Yao Qu and Jue Wang. Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024

work page 2024
[8]

Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

Qian Wang, Jiaying Wu, Zichen Jiang, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, and Bingsheng He. Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

work page arXiv 2025
[9]

Theory of mind in large language models: Assessment and enhancement

Ruirui Chen, Weifeng Jiang, Chengwei Qin, and Cheston Tan. Theory of mind in large language models: Assessment and enhancement. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31539–31558, Vienna, A...

work page 2025
[10]

Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems

Luyu Chen, Quanyu Dai, Zeyu Zhang, Xueyang Feng, Mingyu Zhang, Pengcheng Tang, Xu Chen, Yue Zhu, and Zhenhua Dong. Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems. InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 133–142, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025
[11]

Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

work page arXiv 2025
[12]

Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025

Fernando Miranda and Pedro Paulo Balbi. Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025

work page 2025
[13]

Ljubisa Bojic, Alexander Felfernig, Bojana Dinic, Velibor Ilic, Achim Rettinger, Vera Mevorah, and Damian Trilling. Llm agents predict social media reactions but do not outperform text classifiers: Benchmarking simulation accuracy using 120k+ personas of 1511 humans.arXiv preprint arXiv:2604.19787, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction

Nils Schwager, Simon Münker, Alistair Plum, and Achim Rettinger. Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction. In The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026), pages 208–221, 2026. 10

work page 2026
[15]

Smp challenge: An overview and analysis of social media prediction challenge

Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, and Jiebo Luo. Smp challenge: An overview and analysis of social media prediction challenge. In Proceedings of the 31st ACM International Conference on Multimedia, 2023

work page 2023
[16]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Charactereval: A chinese benchmark for role-playing conversational agent evaluation

Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024

work page 2024
[18]

Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024

Jiaheng Liu, Zehao Ni, Haoran Que, Tao Sun, Zekun Wang, Jian Yang, Jiakai Wang, Hongcheng Guo, Zhongyuan Peng, Ge Zhang, et al. Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024

work page 2024
[19]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Humans or LLMs as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327. Association for Computational Linguistics, 2024

work page 2024
[21]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[22]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

work page 2023
[25]

Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024

work page 2024
[26]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23. Association for Computing Machinery, 2023

work page 2023
[27]

Arriaga, and Adam Tauman Kalai

Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

work page 2023
[28]

Large language models as psychological simulators: A methodological guide

Zhicheng Lin. Large language models as psychological simulators: A methodological guide. Advances in Methods and Practices in Psychological Science, 9(1):25152459251410153, 2026

work page 2026
[29]

A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies

Weihong Qi, Fan Huang, Jisun An, and Haewoon Kwak. A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies. arXiv preprint arXiv:2506.21587, 2025

work page arXiv 2025
[30]

Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026

Stephan Ludwig, Peter J Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin, Dhruv Grewal, and Lan Du. Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026

work page arXiv 2026
[31]

Large language models for market research: A data-augmentation approach.Marketing Science, 2026

Mengxin Wang, Dennis J Zhang, and Heng Zhang. Large language models for market research: A data-augmentation approach.Marketing Science, 2026

work page 2026
[32]

Predicting results of social science experiments using large language models.Preprint, 2024

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

work page 2024
[33]

R., Liu, R., Richardson, S

Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234, 2025

work page arXiv 2025
[34]

Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024

Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, and Robert West. Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024

work page arXiv 2024
[35]

Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025

Feng Xiao and XT XiaoTian Wang. Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025

work page 2025
[36]

Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025

Oliver Slumbers, Joel Z Leibo, and Marco A Janssen. Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025

work page arXiv 2025
[37]

Oasis: Open agents social interaction simulations on one million agents

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agent social interaction simulations with one million agents.arXiv preprint arXiv:2411.11581, 2024

work page arXiv 2024
[38]

Sotopia: Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2023

work page arXiv 2023
[39]

Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025

Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, and Minlie Huang. Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025

work page arXiv 2025
[40]

Infusing Theory of Mind into Socially Intelligent LLM Agents

EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, and Vered Shwartz. Infusing theory of mind into socially intelligent llm agents.arXiv preprint arXiv:2509.22887, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda- Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltet˝o, et al. A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

work page 2025
[42]

Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

work page arXiv 2025
[43]

Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping. InFirst Workshop on Multi-Turn Interactions in Large Language Models

work page
[44]

Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023

James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023

work page 2023
[45]

The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025

Arya Agarwal. The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025. 12

work page 2025
[46]

Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

work page arXiv 2025
[47]

Econagent: large language model-empowered agents for simulating macroeconomic activities

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024

work page 2024
[48]

Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems, 37:137010–137045, 2024

work page 2024
[49]

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, and Leo Huang. Sell more, play less: Bench- marking llm realistic selling skill.arXiv preprint arXiv:2604.07054, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, et al. Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025

work page arXiv 2025
[51]

Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025

Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, and Michiel A Bakker. Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025

work page arXiv 2025
[52]

Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of ai assistants? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35200–35278, 2025

work page 2025
[53]

arXiv preprint arXiv:2601.17087 , year=

Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations.arXiv preprint arXiv:2601.17087, 2026

work page arXiv 2026
[54]

War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

work page arXiv 2023
[55]

S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents.arXiv preprint arXiv:2307.14984, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Gensim: A general social simulation platform with large language model based agents

Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, et al. Gensim: A general social simulation platform with large language model based agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...

work page 2025
[57]

Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025

Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, et al. Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025

work page arXiv 2025
[58]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InICLR, 2024

work page 2024
[59]

Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops

Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, and Bingsheng He. Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036, 2025

work page 2025
[60]

Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025

Jing Liu, Xinxing Ren, Yanmeng Xu, and Zekun Guo. Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025. 13

work page arXiv 2025
[61]

Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

work page 2024
[62]

From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, et al. From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research

work page
[63]

Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

work page 2024
[64]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019

work page 2019
[65]

Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics

Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, et al. Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8397–8437, 2025

work page 2025
[66]

Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025

Min Zeng. Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025

work page arXiv 2025
[67]

Socialbench: Sociality evaluation of role-playing conversational agents

Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Gao Xing, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, and Fei Huang. Socialbench: Sociality evaluation of role-playing conversational agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2108–2126, 2024

work page 2024
[68]

The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas

Giovanni Franco Gabriel Marraffini, Andrés Cotton, Noe Fabian Hsueh, Axel Fridman, Juan Wisznia, and Luciano Del Corro. The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21950–21959, 2024

work page 2024
[69]

Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning

Anonymous. Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning. InSubmitted to The Fourteenth International Conference on Learning Representa- tions, 2025. under review

work page 2025
[70]

Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025

Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, and Xing Xie. Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025

work page arXiv 2025
[71]

Emobench: Evaluating the emotional intelligence of large language models

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024

work page 2024
[72]

Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

work page arXiv 2025
[73]

Aligning {ai} with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. InInternational Conference on Learning Representations, 2021

work page 2021
[74]

Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023. 14

work page arXiv 2023
[75]

Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1840–1873, 2024

work page 2024
[76]

PersonaGym: Evaluating persona agents and LLMs

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik R Narasimhan, and Vishvak Murahari. PersonaGym: Evaluating persona agents and LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Lin- guistics: EMN...

work page 2025
[77]

Klinkert, Steph Buongiorno, and Corey Clark

Lawrence J. Klinkert, Steph Buongiorno, and Corey Clark. Evaluating the efficacy of llms to emulate realistic human personalities.Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 20(1):65–75, Nov. 2024

work page 2024
[78]

Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025

Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, et al. Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025

work page arXiv 2025
[79]

AgentSense: Benchmarking social intelligence of language agents through interactive scenarios

Xinyi Mou, Jingcong Liang, Jiayu Lin, Xinnong Zhang, Xiawei Liu, Shiyue Yang, Rong Ye, Lei Chen, Haoyu Kuang, Xuanjing Huang, and Zhongyu Wei. AgentSense: Benchmarking social intelligence of language agents through interactive scenarios. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas C...

work page 2025
[80]

Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar...

work page 2024

Showing first 80 references.

[1] [1]

Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024

Diyi Yang, Caleb Ziems, William Held, Omar Shaikh, Michael S Bernstein, and John Mitchell. Social skill training with large language models.arXiv preprint arXiv:2404.04204, 2024

work page arXiv 2024

[2] [2]

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm- driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system

Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, et al. Trendsim: Simulating trending topics in social media under poisoning attacks with llm-based multi-agent system. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2930–2949, 2025

work page 2025

[4] [4]

User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

work page 2025

[5] [5]

Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior

Man-Lin Chu et al. Llm-based multi-agent system for simulating and analyzing marketing and consumer behavior. InIEEE International Conference on e-Business Engineering (ICEBE), 2025

work page 2025

[6] [6]

From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, et al. From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

work page arXiv 2024

[7] [7]

Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024

Yao Qu and Jue Wang. Performance and biases of large language models in public opinion simulation.Humanities and Social Sciences Communications, 11(1):1–13, 2024

work page 2024

[8] [8]

Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

Qian Wang, Jiaying Wu, Zichen Jiang, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, and Bingsheng He. Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

work page arXiv 2025

[9] [9]

Theory of mind in large language models: Assessment and enhancement

Ruirui Chen, Weifeng Jiang, Chengwei Qin, and Cheston Tan. Theory of mind in large language models: Assessment and enhancement. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31539–31558, Vienna, A...

work page 2025

[10] [10]

Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems

Luyu Chen, Quanyu Dai, Zeyu Zhang, Xueyang Feng, Mingyu Zhang, Pengcheng Tang, Xu Chen, Yue Zhu, and Zhenhua Dong. Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems. InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 133–142, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025

[11] [11]

Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

work page arXiv 2025

[12] [12]

Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025

Fernando Miranda and Pedro Paulo Balbi. Simulating public opinion: Comparing distributional and individual-level predictions from llms and random forests.Entropy, 27(9):923, 2025

work page 2025

[13] [13]

Ljubisa Bojic, Alexander Felfernig, Bojana Dinic, Velibor Ilic, Achim Rettinger, Vera Mevorah, and Damian Trilling. Llm agents predict social media reactions but do not outperform text classifiers: Benchmarking simulation accuracy using 120k+ personas of 1511 humans.arXiv preprint arXiv:2604.19787, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction

Nils Schwager, Simon Münker, Alistair Plum, and Achim Rettinger. Towards simulating social media users with llms: Evaluating the operational validity of conditioned comment prediction. In The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026), pages 208–221, 2026. 10

work page 2026

[15] [15]

Smp challenge: An overview and analysis of social media prediction challenge

Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, and Jiebo Luo. Smp challenge: An overview and analysis of social media prediction challenge. In Proceedings of the 31st ACM International Conference on Multimedia, 2023

work page 2023

[16] [16]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Charactereval: A chinese benchmark for role-playing conversational agent evaluation

Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024

work page 2024

[18] [18]

Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024

Jiaheng Liu, Zehao Ni, Haoran Que, Tao Sun, Zekun Wang, Jian Yang, Jiakai Wang, Hongcheng Guo, Zhongyuan Peng, Ge Zhang, et al. Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts.Advances in Neural Information Processing Systems, 37:49403–49428, 2024

work page 2024

[19] [19]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Humans or LLMs as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327. Association for Computational Linguistics, 2024

work page 2024

[21] [21]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024

[22] [22]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

work page 2023

[25] [25]

Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024

work page 2024

[26] [26]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23. Association for Computing Machinery, 2023

work page 2023

[27] [27]

Arriaga, and Adam Tauman Kalai

Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

work page 2023

[28] [28]

Large language models as psychological simulators: A methodological guide

Zhicheng Lin. Large language models as psychological simulators: A methodological guide. Advances in Methods and Practices in Psychological Science, 9(1):25152459251410153, 2026

work page 2026

[29] [29]

A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies

Weihong Qi, Fan Huang, Jisun An, and Haewoon Kwak. A cross-cultural comparison of llm-based public opinion simulation: Evaluating chinese and us models on diverse societies. arXiv preprint arXiv:2506.21587, 2025

work page arXiv 2025

[30] [30]

Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026

Stephan Ludwig, Peter J Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin, Dhruv Grewal, and Lan Du. Extracting consumer insight from text: A large language model approach to emotion and evaluation measurement.arXiv preprint arXiv:2602.15312, 2026

work page arXiv 2026

[31] [31]

Large language models for market research: A data-augmentation approach.Marketing Science, 2026

Mengxin Wang, Dennis J Zhang, and Heng Zhang. Large language models for market research: A data-augmentation approach.Marketing Science, 2026

work page 2026

[32] [32]

Predicting results of social science experiments using large language models.Preprint, 2024

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

work page 2024

[33] [33]

R., Liu, R., Richardson, S

Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234, 2025

work page arXiv 2025

[34] [34]

Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024

Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, and Robert West. Assessing social align- ment: Do personality-prompted large language models behave like humans?arXiv preprint arXiv:2412.16772, 2024

work page arXiv 2024

[35] [35]

Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025

Feng Xiao and XT XiaoTian Wang. Evaluating the ability of large language models to predict human social decisions.Scientific Reports, 15(1):32290, 2025

work page 2025

[36] [36]

Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025

Oliver Slumbers, Joel Z Leibo, and Marco A Janssen. Using large language models to simulate human behavioural experiments: Port of mars.arXiv preprint arXiv:2506.05555, 2025

work page arXiv 2025

[37] [37]

Oasis: Open agents social interaction simulations on one million agents

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agent social interaction simulations with one million agents.arXiv preprint arXiv:2411.11581, 2024

work page arXiv 2024

[38] [38]

Sotopia: Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2023

work page arXiv 2023

[39] [39]

Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025

Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, and Minlie Huang. Think socially via cognitive reasoning.arXiv preprint arXiv:2509.22546, 2025

work page arXiv 2025

[40] [40]

Infusing Theory of Mind into Socially Intelligent LLM Agents

EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, and Vered Shwartz. Infusing theory of mind into socially intelligent llm agents.arXiv preprint arXiv:2509.22887, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda- Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltet˝o, et al. A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

work page 2025

[42] [42]

Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

work page arXiv 2025

[43] [43]

Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: per- sonalized simulation of human behaviors via rl-based llm agent in online shopping. InFirst Workshop on Multi-Turn Interactions in Large Language Models

work page

[44] [44]

Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023

James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard Business School Marketing Unit Working Paper, (23-062), 2023

work page 2023

[45] [45]

The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025

Arya Agarwal. The silicon sample: Benchmarking synthetic users against human respondents in market research.Available at SSRN 5835122, 2025. 12

work page 2025

[46] [46]

Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

work page arXiv 2025

[47] [47]

Econagent: large language model-empowered agents for simulating macroeconomic activities

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024

work page 2024

[48] [48]

Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems, 37:137010–137045, 2024

work page 2024

[49] [49]

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, and Leo Huang. Sell more, play less: Bench- marking llm realistic selling skill.arXiv preprint arXiv:2604.07054, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, et al. Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates.arXiv preprint arXiv:2510.25110, 2025

work page arXiv 2025

[51] [51]

Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025

Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, and Michiel A Bakker. Benchmark- ing overton pluralism in llms.arXiv preprint arXiv:2512.01351, 2025

work page arXiv 2025

[52] [52]

Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of ai assistants? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35200–35278, 2025

work page 2025

[53] [53]

arXiv preprint arXiv:2601.17087 , year=

Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations.arXiv preprint arXiv:2601.17087, 2026

work page arXiv 2026

[54] [54]

War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

work page arXiv 2023

[55] [55]

S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents.arXiv preprint arXiv:2307.14984, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Gensim: A general social simulation platform with large language model based agents

Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, et al. Gensim: A general social simulation platform with large language model based agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...

work page 2025

[57] [57]

Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025

Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, et al. Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025

work page arXiv 2025

[58] [58]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InICLR, 2024

work page 2024

[59] [59]

Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops

Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, and Bingsheng He. Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036, 2025

work page 2025

[60] [60]

Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025

Jing Liu, Xinxing Ren, Yanmeng Xu, and Zekun Guo. Can ai automatically analyze public opinion? a llm agents-based agentic pipeline for timely public opinion analysis.arXiv preprint arXiv:2505.11401, 2025. 13

work page arXiv 2025

[61] [61]

Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

work page 2024

[62] [62]

From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, et al. From persona to personalization: A survey on role-playing language agents.Transactions on Machine Learning Research

work page

[63] [63]

Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4):401–416, 2024

work page 2024

[64] [64]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019

work page 2019

[65] [65]

Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics

Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, et al. Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8397–8437, 2025

work page 2025

[66] [66]

Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025

Min Zeng. Psychcounsel-bench: Evaluating the psychology intelligence of large language models.arXiv preprint arXiv:2510.01611, 2025

work page arXiv 2025

[67] [67]

Socialbench: Sociality evaluation of role-playing conversational agents

Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Gao Xing, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, and Fei Huang. Socialbench: Sociality evaluation of role-playing conversational agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2108–2126, 2024

work page 2024

[68] [68]

The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas

Giovanni Franco Gabriel Marraffini, Andrés Cotton, Noe Fabian Hsueh, Axel Fridman, Juan Wisznia, and Luciano Del Corro. The greatest good benchmark: Measuring llms’ alignment with utilitarian moral dilemmas. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21950–21959, 2024

work page 2024

[69] [69]

Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning

Anonymous. Social-r1: Enhancing social intelligence in LLMs through human-like reinforced reasoning. InSubmitted to The Fourteenth International Conference on Learning Representa- tions, 2025. under review

work page 2025

[70] [70]

Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025

Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, and Xing Xie. Motivebench: How far are we from human-like motivational reasoning in large language models?arXiv preprint arXiv:2506.13065, 2025

work page arXiv 2025

[71] [71]

Emobench: Evaluating the emotional intelligence of large language models

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024

work page 2024

[72] [72]

Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

work page arXiv 2025

[73] [73]

Aligning {ai} with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. InInternational Conference on Learning Representations, 2021

work page 2021

[74] [74]

Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023. 14

work page arXiv 2023

[75] [75]

Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1840–1873, 2024

work page 2024

[76] [76]

PersonaGym: Evaluating persona agents and LLMs

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik R Narasimhan, and Vishvak Murahari. PersonaGym: Evaluating persona agents and LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Lin- guistics: EMN...

work page 2025

[77] [77]

Klinkert, Steph Buongiorno, and Corey Clark

Lawrence J. Klinkert, Steph Buongiorno, and Corey Clark. Evaluating the efficacy of llms to emulate realistic human personalities.Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 20(1):65–75, Nov. 2024

work page 2024

[78] [78]

Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025

Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, et al. Human behavior atlas: Benchmarking unified psychological and social behavior understanding.arXiv preprint arXiv:2510.04899, 2025

work page arXiv 2025

[79] [79]

AgentSense: Benchmarking social intelligence of language agents through interactive scenarios

Xinyi Mou, Jingcong Liang, Jiayu Lin, Xinnong Zhang, Xiawei Liu, Shiyue Yang, Rong Ye, Lei Chen, Haoyu Kuang, Xuanjing Huang, and Zhongyu Wei. AgentSense: Benchmarking social intelligence of language agents through interactive scenarios. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas C...

work page 2025

[80] [80]

Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar...

work page 2024