Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Anding Wang; Ji-Rong Wen; Lei Shi; Lei Wang; Nan Lu; Xiaoxing Fu; Xu Chen; Yang Wang; Yuanzi Li

arxiv: 2605.23783 · v1 · pith:F4T4AZIKnew · submitted 2026-05-22 · 💻 cs.CY

Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Xu Chen , Yuanzi Li , Lei Wang , Nan Lu , Yang Wang , Anding Wang , Lei Shi , Xiaoxing Fu

show 1 more author

Ji-Rong Wen

This is my paper

Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3

classification 💻 cs.CY

keywords LLM simulationcommunity governancelife-history narrativescurriculum-LoRAfidelity-cost tradeoffparameter-efficient adaptationresident profilingpolicy evaluation

0 comments

The pith

Curriculum-LoRA matches the strongest LLM simulation fidelity at roughly 10x lower per-call cost using life-history narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding rich first-person life histories from 92 detailed resident interviews raises how closely LLMs reproduce specific individuals' stated views on community governance issues. This gain requires longer prompts that increase token cost, creating a practical barrier for local use. Curriculum-LoRA, a parameter-efficient adaptation method, closes the gap by delivering equivalent fidelity at about one-tenth the cost and dominating other approaches on the cost-fidelity trade-off. The full pipeline then supports closed-loop testing of governance policies through simulation before real deployment, making individualized modeling feasible for resource-limited administrations.

Core claim

Collecting 1.2 million characters of interview data across nine governance domains and testing 18 LLMs shows that life-history profiles improve fidelity over demographic baselines but raise input costs. Curriculum-LoRA then achieves the highest baseline fidelity while cutting per-call cost by a factor of roughly 10 and Pareto-dominating every configuration tested, with the resulting system enabling in-silico pre-evaluation of community policies.

What carries the argument

curriculum-LoRA, a parameter-efficient personalization framework that adapts models to individual life-history profiles to generate resident-specific responses.

If this is right

Rich life-history profiles raise simulation fidelity above the no-profile baseline across the tested LLMs.
Standard prompting with full profiles increases input token counts and therefore per-call cost.
Curriculum-LoRA matches the strongest baseline fidelity at roughly 10 times lower per-call cost.
The method Pareto-dominates every prompting and adaptation configuration tested on the fidelity-cost frontier.
Individual-level resident simulation becomes reachable for resource-constrained local administrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The interview dataset could be reused as a public benchmark for testing other personalization methods on governance attitudes.
Deployment in a real community decision process would test whether the fidelity scores translate into accurate forecasts of votes or survey responses.
Similar narrative-based adaptation might reduce costs in adjacent simulation settings such as patient preference modeling or student learning profiles.
Scaling the approach to additional residents could reveal systematic patterns linking life-history elements to attitude clusters.

Load-bearing premise

The benchmark's fidelity metric, which measures how closely LLM outputs match residents' interview statements, is a valid proxy for how well the simulations would predict actual resident behavior and preferences in real governance decisions.

What would settle it

Run a follow-up round of interviews or votes on new policy proposals with the same 92 residents and check whether the simulated responses from curriculum-LoRA models align with those actual answers at the reported fidelity levels.

Figures

Figures reproduced from arXiv: 2605.23783 by Anding Wang, Ji-Rong Wen, Lei Shi, Lei Wang, Nan Lu, Xiaoxing Fu, Xu Chen, Yang Wang, Yuanzi Li.

**Figure 2.** Figure 2: Benchmarking 18 mainstream LLMs for individual-resident simulation. a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Generalization to unseen residents (a) and unseen governance domains (b). [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: A closed-loop, end-to-end platform for policy simulation and optimization. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a sizable first-person interview dataset and shows curriculum-LoRA can cut token cost while keeping alignment to self-reported views, but offers no test that this alignment predicts real resident behavior in governance settings.

read the letter

The main things here are a 1.2 million character dataset from 92 residents across nine governance domains and the curriculum-LoRA method that reportedly matches high-fidelity baselines at about one-tenth the per-call cost. The benchmark across 18 models demonstrates that richer life-history prompts improve fidelity over simple demographics, and the LoRA approach closes the cost gap enough to Pareto-dominate the tested setups. That combination is the concrete advance: a reusable dataset plus a practical personalization trick for this use case. The system-level claim about closed-loop policy evaluation follows from integrating the model, but rests on the same benchmark numbers. The soft spot is the missing link between matching residents' stated views in interviews and actually forecasting how they would act or vote on real policies. The abstract frames fidelity as alignment with self-reports, yet the governance application requires that this proxy tracks revealed preferences or observed outcomes; nothing in the provided text shows external validation on held-out decisions. Methods details, error bars, and statistical tests are also absent from the abstract, so the strength of the 10x cost claim cannot be judged without the full tables. The work is aimed at groups already running LLM-based social simulations who want cheaper personalization for local policy work. It is coherent on its own terms and engages the literature enough to merit referee time, even if the applied payoff needs more evidence. I would send it for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a dataset of ~1.2M characters of first-person life-history narratives from semi-structured interviews with 92 residents across nine community-governance domains; benchmarks 18 LLMs under four prompting strategies to show that rich profiles increase fidelity to residents' stated views (at the cost of longer prompts); proposes curriculum-LoRA, a parameter-efficient personalization method claimed to match the strongest baseline fidelity at roughly 10x lower per-call cost while Pareto-dominating tested configurations; and integrates the method into a closed-loop policy-evaluation pipeline for in-silico governance decisions.

Significance. If the reported fidelity-cost trade-off holds under rigorous evaluation and the interview-alignment metric proves predictive of real behavior, the work could lower barriers for resource-constrained local administrations to pre-test policies. The dataset and benchmark also supply a concrete testbed for personalization techniques. However, the absence of any external validation tying benchmark scores to observed votes, participation, or policy outcomes substantially limits the immediate applied significance.

major comments (2)

[Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.
[Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.

minor comments (2)

[Abstract] The abstract refers to 'four prompting strategies' and '18 mainstream LLMs' without naming them or indicating where the full list and exact protocol appear; this should be clarified with a table or section reference for reproducibility.
[Abstract] Notation for the fidelity metric and cost metric is not defined in the abstract; explicit definitions (even if deferred to §3) would help readers assess the Pareto-dominance claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment below with specific revisions where appropriate and clarifications on scope.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.

Authors: We agree the abstract should include quantitative support. The main text (Section 4, Tables 2-4) reports mean fidelity scores with standard errors, per-token costs, and paired t-tests showing curriculum-LoRA matches the strongest baseline (p>0.05) at 9.8x lower cost while dominating all other configurations on the Pareto front. In revision we will insert these specific values, error bars, and test results into the abstract to make the claim self-contained. revision: yes
Referee: [Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.

Authors: The manuscript evaluates fidelity strictly to the collected interview narratives and does not claim or demonstrate correlation with external behavioral data. We will revise the abstract and system description to state explicitly that the in-silico pipeline is intended for simulation conditioned on interview profiles, with the acknowledged limitation that predictive validity for real-world actions remains untested in this study. revision: partial

standing simulated objections not resolved

Absence of external validation linking fidelity scores to observed votes, participation, or policy outcomes, as no such ground-truth data were collected.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct fidelity measurements

full rationale

The paper reports an empirical study: interview data collection (1.2M characters from 92 residents), benchmarking of 18 LLMs across prompting strategies, definition of fidelity as alignment with residents' stated interview views, and evaluation of curriculum-LoRA showing it matches baseline fidelity at lower cost. No equations, derivations, or first-principles claims exist. Performance results are measured outcomes on the held-out or tested interview responses rather than quantities defined in terms of fitted parameters or reduced by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central claims remain independent empirical findings on the provided dataset and benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. curriculum-LoRA implies training hyperparameters and a curriculum schedule, but none are enumerated.

pith-pipeline@v0.9.0 · 5816 in / 1183 out tokens · 24661 ms · 2026-05-25T02:47:53.101290+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

[1]

Cambridge university press, 1990

Elinor Ostrom.Governing the commons: The evolution of institutions for collective action. Cambridge university press, 1990

work page 1990
[2]

Bowling alone: The collapse and revival of american community.Simon Schuster, 2000

Robert D Putnam. Bowling alone: The collapse and revival of american community.Simon Schuster, 2000

work page 2000
[3]

University of Michigan Press, 2000

Joe Soss.Unwanted claims: The politics of participation in the US welfare system. University of Michigan Press, 2000

work page 2000
[4]

Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993

Anne Schneider and Helen Ingram. Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993

work page 1993
[5]

Uni- versity of Chicago press, 2012

Robert J Sampson.Great American city: Chicago and the enduring neighborhood effect. Uni- versity of Chicago press, 2012

work page 2012
[6]

ThomasHeberer. Evolvementofcitizenshipinurbanchinaorauthoritariancommunitarianism? neighborhood development, community participation, and autonomy.Journal of Contemporary China, 18(61):491–515, 2009

work page 2009
[7]

John Wiley & Sons, 2011

Robert M Groves, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau.Survey methodology. John Wiley & Sons, 2011

work page 2011
[8]

Wiley, 4th edition, 2014

Don A Dillman, Jolene D Smyth, and Leah Melani Christian.Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. Wiley, 4th edition, 2014. 18

work page 2014
[9]

Princeton University Press, 2019

Matthew J Salganik.Bit by bit: Social research in the digital age. Princeton University Press, 2019

work page 2019
[10]

Automated social science: Language models as scientist and subjects

Benjamin S Manning, Kehang Zhu, and John J Horton. Automated social science: Language models as scientist and subjects. Technical report, National Bureau of Economic Research, 2024

work page 2024
[12]

Whose opinions do language models reflect?Proc

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect?Proc. 40th International Confer- ence on Machine Learning (ICML), 2023

work page 2023
[13]

Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys

ChenxiLin, Weikang Yuan, Zhuoren Jiang, BiaoHuang, RuitaoZhang, Jianan Ge, YueqianXu, and Jianxing Yu. Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38908–38916, 2026

work page 2026
[14]

Using large language models to simulate multiple humans and replicate human subject studies

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, pages 337–371. PMLR, 2023

work page 2023
[15]

Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

John J Horton, Apostolos Filippas, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

work page 2023
[16]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[17]

User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

work page 2025
[18]

Simulating classroom education with llm-empowered agents

Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianx- iao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

work page 2025
[19]

Aligning language mod- els to user opinions.Findings of EMNLP, 2023

EunJeong Hwang, Bodhisattwa Prasad Majumder, and Niket Tandon. Aligning language mod- els to user opinions.Findings of EMNLP, 2023

work page 2023
[20]

Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024

work page 2024
[21]

How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006

Greg Guest, Arwen Bunce, and Laura Johnson. How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006. 19

work page 2006
[22]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori- cut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, ArvindNeelakantan, PranavShyam, GirishSastry, AmandaAskell, etal. Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[28]

Toxicity in chatgpt: Analyzing persona-assigned language models

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. InFindings of the association for computational linguistics: EMNLP 2023, pages 1236–1270, 2023

work page 2023
[29]

Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

work page 2023
[30]

Assessing the reliability of persona-conditioned llms as synthetic survey respondents

Erika Elizabeth Taday Morocho, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. Assessing the reliability of persona-conditioned llms as synthetic survey respondents. arXiv preprint arXiv:2602.18462, 2026

work page arXiv 2026
[31]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[32]

Language model fine-tuning on scaled survey data for predicting distributions of public opinions

Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, and Serina Chang. Language model fine-tuning on scaled survey data for predicting distributions of public opinions. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21147–21170, 2025

work page 2025
[33]

Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification

Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, and Kristina Gligorić. Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification. arXiv preprint arXiv:2510.11408, 2025

work page arXiv 2025
[34]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProc. 26th Annual International Conference on Machine Learning (ICML), pages 41–48, 2009. 20

work page 2009
[35]

A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

XinWang, Yudong Chen, and Wenwu Zhu. A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

work page 2021
[36]

Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022

work page 2022
[37]

Società editrice libraria, 1919

Vilfredo Pareto.Manuale di economia politica con una introduzione alla scienza sociale, vol- ume 13. Società editrice libraria, 1919

work page 1919
[38]

Sociobench: Modeling human behavior in sociological surveys with large language models

Jia Wang, Ziyu Zhao, Tingjuntao Ni, and Zhongyu Wei. Sociobench: Modeling human behavior in sociological surveys with large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26268–26300, 2025. 21

work page 2025

[1] [1]

Cambridge university press, 1990

Elinor Ostrom.Governing the commons: The evolution of institutions for collective action. Cambridge university press, 1990

work page 1990

[2] [2]

Bowling alone: The collapse and revival of american community.Simon Schuster, 2000

Robert D Putnam. Bowling alone: The collapse and revival of american community.Simon Schuster, 2000

work page 2000

[3] [3]

University of Michigan Press, 2000

Joe Soss.Unwanted claims: The politics of participation in the US welfare system. University of Michigan Press, 2000

work page 2000

[4] [4]

Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993

Anne Schneider and Helen Ingram. Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993

work page 1993

[5] [5]

Uni- versity of Chicago press, 2012

Robert J Sampson.Great American city: Chicago and the enduring neighborhood effect. Uni- versity of Chicago press, 2012

work page 2012

[6] [6]

ThomasHeberer. Evolvementofcitizenshipinurbanchinaorauthoritariancommunitarianism? neighborhood development, community participation, and autonomy.Journal of Contemporary China, 18(61):491–515, 2009

work page 2009

[7] [7]

John Wiley & Sons, 2011

Robert M Groves, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau.Survey methodology. John Wiley & Sons, 2011

work page 2011

[8] [8]

Wiley, 4th edition, 2014

Don A Dillman, Jolene D Smyth, and Leah Melani Christian.Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. Wiley, 4th edition, 2014. 18

work page 2014

[9] [9]

Princeton University Press, 2019

Matthew J Salganik.Bit by bit: Social research in the digital age. Princeton University Press, 2019

work page 2019

[10] [10]

Automated social science: Language models as scientist and subjects

Benjamin S Manning, Kehang Zhu, and John J Horton. Automated social science: Language models as scientist and subjects. Technical report, National Bureau of Economic Research, 2024

work page 2024

[11] [12]

Whose opinions do language models reflect?Proc

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect?Proc. 40th International Confer- ence on Machine Learning (ICML), 2023

work page 2023

[12] [13]

Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys

ChenxiLin, Weikang Yuan, Zhuoren Jiang, BiaoHuang, RuitaoZhang, Jianan Ge, YueqianXu, and Jianxing Yu. Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38908–38916, 2026

work page 2026

[13] [14]

Using large language models to simulate multiple humans and replicate human subject studies

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, pages 337–371. PMLR, 2023

work page 2023

[14] [15]

Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

John J Horton, Apostolos Filippas, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

work page 2023

[15] [16]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[16] [17]

User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

work page 2025

[17] [18]

Simulating classroom education with llm-empowered agents

Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianx- iao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

work page 2025

[18] [19]

Aligning language mod- els to user opinions.Findings of EMNLP, 2023

EunJeong Hwang, Bodhisattwa Prasad Majumder, and Niket Tandon. Aligning language mod- els to user opinions.Findings of EMNLP, 2023

work page 2023

[19] [20]

Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024

work page 2024

[20] [21]

How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006

Greg Guest, Arwen Bunce, and Laura Johnson. How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006. 19

work page 2006

[21] [22]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [23]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [24]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori- cut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [27]

Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, ArvindNeelakantan, PranavShyam, GirishSastry, AmandaAskell, etal. Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[27] [28]

Toxicity in chatgpt: Analyzing persona-assigned language models

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. InFindings of the association for computational linguistics: EMNLP 2023, pages 1236–1270, 2023

work page 2023

[28] [29]

Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

work page 2023

[29] [30]

Assessing the reliability of persona-conditioned llms as synthetic survey respondents

Erika Elizabeth Taday Morocho, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. Assessing the reliability of persona-conditioned llms as synthetic survey respondents. arXiv preprint arXiv:2602.18462, 2026

work page arXiv 2026

[30] [31]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[31] [32]

Language model fine-tuning on scaled survey data for predicting distributions of public opinions

Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, and Serina Chang. Language model fine-tuning on scaled survey data for predicting distributions of public opinions. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21147–21170, 2025

work page 2025

[32] [33]

Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification

Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, and Kristina Gligorić. Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification. arXiv preprint arXiv:2510.11408, 2025

work page arXiv 2025

[33] [34]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProc. 26th Annual International Conference on Machine Learning (ICML), pages 41–48, 2009. 20

work page 2009

[34] [35]

A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

XinWang, Yudong Chen, and Wenwu Zhu. A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

work page 2021

[35] [36]

Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022

work page 2022

[36] [37]

Società editrice libraria, 1919

Vilfredo Pareto.Manuale di economia politica con una introduzione alla scienza sociale, vol- ume 13. Società editrice libraria, 1919

work page 1919

[37] [38]

Sociobench: Modeling human behavior in sociological surveys with large language models

Jia Wang, Ziyu Zhao, Tingjuntao Ni, and Zhongyu Wei. Sociobench: Modeling human behavior in sociological surveys with large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26268–26300, 2025. 21

work page 2025