pith. sign in

arxiv: 2509.06337 · v2 · submitted 2025-09-08 · 💻 cs.AI

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

Pith reviewed 2026-05-18 18:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelssurvey simulationsociodemographic responsesvirtual respondentsbenchmarkprompt engineeringsynthetic data generation
0
0 comments X

The pith

Large language models can generate sociodemographic survey responses with consistent patterns across different models and prompt settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ways to use large language models to act as virtual survey respondents by either filling in missing details from partial information or creating entire new response sets. It builds a collection of eleven real survey datasets covering four areas of social research to test several models in different prompting styles. The results point to steady behaviors in how well models perform these tasks, influenced by added context, but also show problems when models must output structured data. The work positions these methods as tools for exploration rather than substitutes for actual surveys.

Core claim

Partial Attribute Simulation (PAS) lets models predict missing attributes from incomplete respondent profiles, while Full Attribute Simulation (FAS) has models create complete synthetic datasets with or without additional context. Using the LLM-S^3 benchmark of eleven public datasets, evaluations of GPT-3.5/4 Turbo and LLaMA 3 models reveal consistent performance trends across model families and show that context and prompt design influence how closely the generated responses match real distributions.

What carries the argument

The PAS and FAS task abstractions framed as diagnostic tools for assessing LLM simulation of sociodemographic responses on a multi-domain benchmark.

If this is right

  • Adding context to prompts tends to increase how well the generated responses align with real survey data distributions.
  • Models from different families display similar trends in their simulation accuracy on both partial and full attribute tasks.
  • Generating structured outputs like formatted datasets remains unreliable and is a key limitation.
  • Prompt design choices have a direct impact on the fidelity of the simulated survey responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such simulations could allow researchers to test survey questions on synthetic populations before running costly real surveys.
  • Comparing LLM outputs to real data might help identify and mitigate demographic biases in language models.
  • Extending the approach to new domains could provide quick insights into public opinion on emerging issues.

Load-bearing premise

That how closely the model outputs match the distributions in these eleven chosen datasets indicates whether they would produce useful and unbiased results in actual social science work.

What would settle it

Collect responses from a fresh survey on a topic outside the benchmark datasets and compare the conclusions drawn from real data versus LLM-generated data on the same questions.

Figures

Figures reproduced from arXiv: 2509.06337 by Chenyu Yuan, Denghui Zhang, Guangwei Zhang, Haoling Xie, Jianpeng Zhao, Pengyang Wang, Steven Jige Quan, Weiming Luo, Zixuan Yuan.

Figure 1
Figure 1. Figure 1: An Overview of Partial Attribute Simulation (left) and Full Attribute Simulation (right). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average PAS performance (↑) across two tasks, grouped by model series and evaluation strategy. Numerical Prediction Tasks [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative error cases in FAS. Experimental results in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistical analysis of failure rates across different attributes in FAS. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance Comparison between LLaMA models and their DeepSeek-R1-Distill counterparts. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of contextual prompts on predictive accuracy in multiple-choice tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of output distributions under different few-shot example configurations. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Radar plots illustrating model performance on the PAS scenario across datasets (each axis is individually scaled per dataset [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Radar plots illustrating model performance on the FAS scenario across datasets (each axis is individually scaled per dataset to [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A representative prompt utilized in zero-shot and few-shot evaluation conducted on the ANES dataset within the PAS [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A representative prompt utilized in zero-context and context-enhanced generation conducted on the RECS dataset within the [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Questionnaire-based surveys are foundational to social science research and public policymaking, yet traditional survey methods remain costly, time-consuming, and often limited in scale. Although prior work has explored large language models (LLMs) as virtual survey respondents, existing studies often address narrow task settings, focus on single sociological domains, or lack a unified evaluation framework that enables systematic comparison across diverse datasets and models. To address these gaps, we introduce two complementary task abstractions: Partial Attribute Simulation (PAS), where LLMs predict missing attributes from incomplete respondent profiles, and Full Attribute Simulation (FAS), where LLMs generate complete synthetic datasets under zero-context and context-enhanced conditions. Both are framed as diagnostic and exploratory tools rather than replacements for human data collection. We curate LLM-S^3 (Large Language Model-based Sociodemographic Survey Simulation), a benchmark spanning 11 real-world public datasets across four sociological domains, and evaluate GPT-3.5/4 Turbo and LLaMA 3.0/3.1-8B under zero-shot and few-shot settings. Our evaluation reveals consistent performance trends across model families, highlights failure modes in structured output generation, and demonstrates how context and prompt design affect simulation fidelity. Our code and dataset are available at: https://github.com/dart-lab-research/LLM-S-Cube-Benchmark

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS) tasks to evaluate LLMs as virtual survey respondents. It curates the LLM-S^3 benchmark with 11 real-world public datasets across four sociological domains and evaluates GPT-3.5/4 Turbo and LLaMA 3.0/3.1-8B models under zero-shot and few-shot prompting, reporting consistent performance trends across model families, effects of context and prompt design on fidelity, and failure modes in structured output generation. The work positions these as diagnostic tools rather than replacements for human surveys, with public code and data released.

Significance. If the central empirical findings hold, the unified PAS/FAS framework and LLM-S^3 benchmark would provide a reproducible basis for comparing LLMs on sociodemographic simulation tasks, with value for identifying prompting strategies and output failures. The open release of code and datasets is a clear strength that supports follow-on work. However, the significance for social-science applications hinges on whether marginal distribution matching serves as a sufficient proxy for preserving joint distributions and avoiding LLM-induced biases.

major comments (2)
  1. [§4 and §5] §4 (Evaluation Framework) and §5 (Results): The fidelity metrics for both PAS and FAS appear to emphasize per-attribute accuracy or marginal distribution matches against the 11 source datasets. No results are shown for preservation of joint distributions (e.g., correlation matrices between sociodemographic variables or conditional probabilities). This is load-bearing for the claim of 'simulation fidelity' because marginal agreement can coexist with distorted correlations, which would limit utility in actual survey simulation.
  2. [§5.2] §5.2 (FAS experiments): The paper reports performance trends and prompt-design effects but does not include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the differences across models, zero-shot vs. few-shot, or zero-context vs. context-enhanced conditions. Without these, it is difficult to determine whether the 'consistent performance trends' are robust or could be explained by prompt sensitivity.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief explicit statement of the exact metrics (accuracy, KL divergence, etc.) used to quantify 'simulation fidelity' so readers can immediately assess the evaluation scope.
  2. [Results figures] Table captions and axis labels in the results figures should clarify whether reported values are averages over all 11 datasets or broken down by domain, to improve interpretability of the cross-model trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of evaluation that can strengthen the manuscript. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Evaluation Framework) and §5 (Results): The fidelity metrics for both PAS and FAS appear to emphasize per-attribute accuracy or marginal distribution matches against the 11 source datasets. No results are shown for preservation of joint distributions (e.g., correlation matrices between sociodemographic variables or conditional probabilities). This is load-bearing for the claim of 'simulation fidelity' because marginal agreement can coexist with distorted correlations, which would limit utility in actual survey simulation.

    Authors: We agree that marginal distribution matching alone is insufficient to fully substantiate claims of simulation fidelity, as distorted correlations or conditional relationships could undermine utility for downstream survey applications. The current evaluation framework in §§4–5 prioritizes per-attribute accuracy and marginal matches to establish baseline performance across the heterogeneous LLM-S^3 benchmark. We will add joint-distribution analyses (correlation matrices and selected conditional probabilities) to the revised §5, computed on the same 11 datasets, to address this limitation directly. revision: yes

  2. Referee: [§5.2] §5.2 (FAS experiments): The paper reports performance trends and prompt-design effects but does not include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the differences across models, zero-shot vs. few-shot, or zero-context vs. context-enhanced conditions. Without these, it is difficult to determine whether the 'consistent performance trends' are robust or could be explained by prompt sensitivity.

    Authors: The referee correctly notes that formal statistical tests would better establish the robustness of the reported trends. While the patterns held consistently across 11 datasets and multiple model families, we did not include significance testing in the original submission. In the revision we will add paired t-tests and bootstrap confidence intervals to the comparisons in §5.2, quantifying differences between models, zero-shot/few-shot settings, and context conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation on external datasets

full rationale

The paper defines PAS and FAS as task abstractions, curates the LLM-S^3 benchmark from 11 independent real-world public datasets across sociological domains, and reports observed performance trends, failure modes, and effects of context/prompt design under zero-shot and few-shot settings for specific LLMs. No equations, fitted parameters renamed as predictions, self-citations, uniqueness theorems, or ansatzes appear in the provided text; all claims rest on direct comparison to external ground-truth distributions rather than any reduction to the paper's own inputs or prior author work. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that matching distributions in the curated real datasets is a valid test of simulation quality; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Real-world public survey datasets provide ground-truth distributions against which LLM outputs can be meaningfully compared.
    Invoked when defining PAS and FAS evaluation.

pith-pipeline@v0.9.0 · 5792 in / 1089 out tokens · 28618 ms · 2026-05-18T18:43:17.623693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes

    cs.CY 2026-04 conditional novelty 5.0

    Demographic-only LLM agents for retirement survey prediction exhibit central tendency bias, fail to reproduce incorrect or 'don't know' answers, and miss factor interactions in regressions, unlike survey-anchored agents.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. 2025. Llm social simulations are a promising research method. arXiv preprint arXiv:2504.02234 (2025)

  3. [3]

    Argyle, Ethan C

    Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua Gubler, Christopher Michael Rytting, and David Wingate. 2022. Out of One, Many: Using Language Models to Simulate Human Samples. CoRR abs/2209.06899 (2022). arXiv:2209.06899 doi:10.48550/ARXIV.2209.06899

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  5. [5]

    Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, and Daniel Hershcovich. 2025. Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

  6. [6]

    Yaqub Chaudhary and Jonnie Penn. 2024. Large Language Models as Instruments of Power: New Regimes of Autonomous Manipulation and Control. arXiv preprint arXiv:2405.03813 (2024)

  7. [7]

    Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, Kai Zheng, Defu Lian, and Enhong Chen. 2024. When large language models meet personalization: perspectives of challenges and opportunities. World Wide Web (WWW) 27, 4 (2024), 42. doi:10.1007/S11280-024-01276-1

  8. [8]

    Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. 2025. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235 (2025). Manuscript submitted to ACM Large Language Models as Virtual Survey Respondents...

  9. [9]

    Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. 2023. Can AI language models replace human participants? Trends in Cognitive Sciences 27, 7 (2023), 597–600

  10. [10]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Compu...

  11. [11]

    Naoki Egami, Musashi Hinck, Brandon M Stewart, and Hanying Wei. 2024. Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses. ”. (2024)

  12. [12]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  13. [13]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  14. [14]

    Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, and Michael R Lyu. 2023. Emotionally numb or empathetic? evaluating how llms feel using emotionbench. arXiv preprint arXiv:2308.03656 (2023)

  15. [15]

    Bernard J Jansen, Soon-gyo Jung, and Joni Salminen. 2023. Employing large language models in survey research. Natural Language Processing Journal 4 (2023), 100020

  16. [16]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  17. [17]

    Junsol Kim and Byungkyu Lee. 2023. AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction. arXiv preprint arXiv:2305.09620 (2023)

  18. [18]

    Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann, and Walter Daelemans. 2024. PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits. CoRR abs/2401.07363 (2024). arXiv:2401.07363 doi:10.48550/ARXIV.2401.07363

  19. [19]

    Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, and Zhongyu Wei

  20. [20]

    From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563, 2024

    From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents. CoRR abs/2412.03563 (2024). arXiv:2412.03563 doi:10.48550/ARXIV.2412.03563

  21. [21]

    Man Tik Ng, Hui Tung Tse, Jen-tse Huang, Jingjing Li, Wenxuan Wang, and Michael R. Lyu. 2024. How Well Can LLMs Echo Us? Evaluating AI Chatbots’ Role-Play Ability with ECHO. CoRR abs/2404.13957 (2024). arXiv:2404.13957 doi:10.48550/ARXIV.2404.13957

  22. [22]

    Mateusz Ochal, Massimiliano Patacchiola, Jose Vazquez, Amos Storkey, and Sen Wang. 2023. Few-Shot Learning With Class Imbalance. IEEE Transactions on Artificial Intelligence 4, 5 (2023), 1348–1358. doi:10.1109/TAI.2023.3298303

  23. [23]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744

  24. [24]

    Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180 (2023)

  25. [25]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie J. Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S. Bernstein. 2024. Generative Agent Simulations of 1,000 People. CoRR abs/2411.10109 (2024). arXiv:2411.10109 doi:10.48550/ARXIV.2411.10109

  26. [26]

    Letian Peng and Jingbo Shang. 2024. Quantifying and Optimizing Global Faithfulness in Persona-driven Role-playing.arXiv preprint arXiv:2405.07726 (2024)

  27. [27]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems 36 (2023), 53728–53741

  28. [28]

    Suhas Kamasetty Ramesh, Ayan Sengupta, and Tanmoy Chakraborty. 2025. On the Generalization vs Fidelity Paradox in Knowledge Distillation. arXiv preprint arXiv:2505.15442 (2025)

  29. [29]

    Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. 2024. ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models. arXiv preprint arXiv:2406.04214 (2024)

  30. [30]

    Hannes Rosenbusch, Claire E Stevenson, and Han LJ van der Maas. 2023. How accurate are GPT-3’s hypotheses about social science phenomena? Digital Society 2, 2 (2023), 26

  31. [31]

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. CoRR abs/2402.07927 (2024). arXiv:2402.07927 doi:10.48550/ARXIV.2402.07927

  32. [32]

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. LaMP: When Large Language Models Meet Personalization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Associa...

  33. [33]

    Ran- dom silicon sampling: Simulating human sub- population opinion using a large language model based on group-level demographic information

    Seungjong Sun, Eungu Lee, Dongyan Nan, Xiangying Zhao, Wonbyung Lee, Bernard J. Jansen, and Jang-Hyun Kim. 2024. Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a Large Language Model Based on Group-Level Demographic Information. CoRR abs/2402.18144 (2024). arXiv:2402.18144 doi:10.48550/ARXIV.2402.18144 Manuscript submitted to ACM 1...

  34. [34]

    Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588 (2023)

  35. [35]

    Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. 2023. Simulating social media using large language models to evaluate alternative news feed algorithms. arXiv preprint arXiv:2310.05984 (2023)

  36. [36]

    Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. 2024. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1840–1873

  37. [37]

    Qiujie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. 2025. Human Simulacra: Benchmarking the Personification of Large Language Models. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=BCP5nAHXqs

  38. [38]

    Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. 2024. Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing? arXiv preprint arXiv:2404.12138 (2024)

  39. [39]

    Kaiqi Yang, Hang Li, Hongzhi Wen, Tai-Quan Peng, Jiliang Tang, and Hui Liu. 2024. Are Large Language Models (LLMs) Good Social Predictors?. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 2718–2730. doi:10.186...

  40. [40]

    Jiashu Yao, Heyan Huang, Zeming Liu, Haoyu Wen, Wei Su, Boao Qian, and Yuhang Guo. 2025. ReFF: Reinforcing Format Faithfulness in Language Models Across Varied Tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25660–25668

  41. [41]

    Chenxiao Yu, Zhaotian Weng, Zheng Li, Xiyang Hu, and Yue Zhao. 2024. Will Trump Win in 2024? Predicting the US Presidential Election via Multi-step Reasoning with Large Language Models. arXiv preprint arXiv:2411.03321 (2024)

  42. [42]

    Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguisti...

  43. [43]

    Muzhi Zhou, Lu Yu, Xiaomin Geng, and Lan Luo. 2024. ChatGPT vs Social Surveys: Probing the Objective and Subjective Human Society. CoRR abs/2409.02601 (2024). arXiv:2409.02601 doi:10.48550/ARXIV.2409.02601

  44. [44]

    Input Attributes

    Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. Can large language models transform computational social science? Computational Linguistics 50, 1 (2024), 237–291. Manuscript submitted to ACM Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation 15 A Experimental Details ...