Benchmarking LLMs for Community Governance Simulation with Life-history Narratives
Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3
The pith
Curriculum-LoRA matches the strongest LLM simulation fidelity at roughly 10x lower per-call cost using life-history narratives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Collecting 1.2 million characters of interview data across nine governance domains and testing 18 LLMs shows that life-history profiles improve fidelity over demographic baselines but raise input costs. Curriculum-LoRA then achieves the highest baseline fidelity while cutting per-call cost by a factor of roughly 10 and Pareto-dominating every configuration tested, with the resulting system enabling in-silico pre-evaluation of community policies.
What carries the argument
curriculum-LoRA, a parameter-efficient personalization framework that adapts models to individual life-history profiles to generate resident-specific responses.
If this is right
- Rich life-history profiles raise simulation fidelity above the no-profile baseline across the tested LLMs.
- Standard prompting with full profiles increases input token counts and therefore per-call cost.
- Curriculum-LoRA matches the strongest baseline fidelity at roughly 10 times lower per-call cost.
- The method Pareto-dominates every prompting and adaptation configuration tested on the fidelity-cost frontier.
- Individual-level resident simulation becomes reachable for resource-constrained local administrations.
Where Pith is reading between the lines
- The interview dataset could be reused as a public benchmark for testing other personalization methods on governance attitudes.
- Deployment in a real community decision process would test whether the fidelity scores translate into accurate forecasts of votes or survey responses.
- Similar narrative-based adaptation might reduce costs in adjacent simulation settings such as patient preference modeling or student learning profiles.
- Scaling the approach to additional residents could reveal systematic patterns linking life-history elements to attitude clusters.
Load-bearing premise
The benchmark's fidelity metric, which measures how closely LLM outputs match residents' interview statements, is a valid proxy for how well the simulations would predict actual resident behavior and preferences in real governance decisions.
What would settle it
Run a follow-up round of interviews or votes on new policy proposals with the same 92 residents and check whether the simulated responses from curriculum-LoRA models align with those actual answers at the reported fidelity levels.
Figures
read the original abstract
Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a dataset of ~1.2M characters of first-person life-history narratives from semi-structured interviews with 92 residents across nine community-governance domains; benchmarks 18 LLMs under four prompting strategies to show that rich profiles increase fidelity to residents' stated views (at the cost of longer prompts); proposes curriculum-LoRA, a parameter-efficient personalization method claimed to match the strongest baseline fidelity at roughly 10x lower per-call cost while Pareto-dominating tested configurations; and integrates the method into a closed-loop policy-evaluation pipeline for in-silico governance decisions.
Significance. If the reported fidelity-cost trade-off holds under rigorous evaluation and the interview-alignment metric proves predictive of real behavior, the work could lower barriers for resource-constrained local administrations to pre-test policies. The dataset and benchmark also supply a concrete testbed for personalization techniques. However, the absence of any external validation tying benchmark scores to observed votes, participation, or policy outcomes substantially limits the immediate applied significance.
major comments (2)
- [Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.
- [Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.
minor comments (2)
- [Abstract] The abstract refers to 'four prompting strategies' and '18 mainstream LLMs' without naming them or indicating where the full list and exact protocol appear; this should be clarified with a table or section reference for reproducibility.
- [Abstract] Notation for the fidelity metric and cost metric is not defined in the abstract; explicit definitions (even if deferred to §3) would help readers assess the Pareto-dominance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below with specific revisions where appropriate and clarifications on scope.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.
Authors: We agree the abstract should include quantitative support. The main text (Section 4, Tables 2-4) reports mean fidelity scores with standard errors, per-token costs, and paired t-tests showing curriculum-LoRA matches the strongest baseline (p>0.05) at 9.8x lower cost while dominating all other configurations on the Pareto front. In revision we will insert these specific values, error bars, and test results into the abstract to make the claim self-contained. revision: yes
-
Referee: [Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.
Authors: The manuscript evaluates fidelity strictly to the collected interview narratives and does not claim or demonstrate correlation with external behavioral data. We will revise the abstract and system description to state explicitly that the in-silico pipeline is intended for simulation conditioned on interview profiles, with the acknowledged limitation that predictive validity for real-world actions remains untested in this study. revision: partial
- Absence of external validation linking fidelity scores to observed votes, participation, or policy outcomes, as no such ground-truth data were collected.
Circularity Check
No circularity: empirical benchmark with direct fidelity measurements
full rationale
The paper reports an empirical study: interview data collection (1.2M characters from 92 residents), benchmarking of 18 LLMs across prompting strategies, definition of fidelity as alignment with residents' stated interview views, and evaluation of curriculum-LoRA showing it matches baseline fidelity at lower cost. No equations, derivations, or first-principles claims exist. Performance results are measured outcomes on the held-out or tested interview responses rather than quantities defined in terms of fitted parameters or reduced by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central claims remain independent empirical findings on the provided dataset and benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cambridge university press, 1990
Elinor Ostrom.Governing the commons: The evolution of institutions for collective action. Cambridge university press, 1990
work page 1990
-
[2]
Bowling alone: The collapse and revival of american community.Simon Schuster, 2000
Robert D Putnam. Bowling alone: The collapse and revival of american community.Simon Schuster, 2000
work page 2000
-
[3]
University of Michigan Press, 2000
Joe Soss.Unwanted claims: The politics of participation in the US welfare system. University of Michigan Press, 2000
work page 2000
-
[4]
Anne Schneider and Helen Ingram. Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993
work page 1993
-
[5]
Uni- versity of Chicago press, 2012
Robert J Sampson.Great American city: Chicago and the enduring neighborhood effect. Uni- versity of Chicago press, 2012
work page 2012
-
[6]
ThomasHeberer. Evolvementofcitizenshipinurbanchinaorauthoritariancommunitarianism? neighborhood development, community participation, and autonomy.Journal of Contemporary China, 18(61):491–515, 2009
work page 2009
-
[7]
Robert M Groves, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau.Survey methodology. John Wiley & Sons, 2011
work page 2011
-
[8]
Don A Dillman, Jolene D Smyth, and Leah Melani Christian.Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. Wiley, 4th edition, 2014. 18
work page 2014
-
[9]
Princeton University Press, 2019
Matthew J Salganik.Bit by bit: Social research in the digital age. Princeton University Press, 2019
work page 2019
-
[10]
Automated social science: Language models as scientist and subjects
Benjamin S Manning, Kehang Zhu, and John J Horton. Automated social science: Language models as scientist and subjects. Technical report, National Bureau of Economic Research, 2024
work page 2024
-
[12]
Whose opinions do language models reflect?Proc
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect?Proc. 40th International Confer- ence on Machine Learning (ICML), 2023
work page 2023
-
[13]
Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys
ChenxiLin, Weikang Yuan, Zhuoren Jiang, BiaoHuang, RuitaoZhang, Jianan Ge, YueqianXu, and Jianxing Yu. Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38908–38916, 2026
work page 2026
-
[14]
Using large language models to simulate multiple humans and replicate human subject studies
Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, pages 337–371. PMLR, 2023
work page 2023
-
[15]
John J Horton, Apostolos Filippas, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023
work page 2023
-
[16]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[17]
Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025
work page 2025
-
[18]
Simulating classroom education with llm-empowered agents
Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianx- iao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...
work page 2025
-
[19]
Aligning language mod- els to user opinions.Findings of EMNLP, 2023
EunJeong Hwang, Bodhisattwa Prasad Majumder, and Niket Tandon. Aligning language mod- els to user opinions.Findings of EMNLP, 2023
work page 2023
-
[20]
Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024
work page 2024
-
[21]
Greg Guest, Arwen Bunce, and Laura Johnson. How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006. 19
work page 2006
-
[22]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori- cut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, ArvindNeelakantan, PranavShyam, GirishSastry, AmandaAskell, etal. Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[28]
Toxicity in chatgpt: Analyzing persona-assigned language models
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. InFindings of the association for computational linguistics: EMNLP 2023, pages 1236–1270, 2023
work page 2023
-
[29]
Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023
work page 2023
-
[30]
Assessing the reliability of persona-conditioned llms as synthetic survey respondents
Erika Elizabeth Taday Morocho, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. Assessing the reliability of persona-conditioned llms as synthetic survey respondents. arXiv preprint arXiv:2602.18462, 2026
-
[31]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[32]
Language model fine-tuning on scaled survey data for predicting distributions of public opinions
Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, and Serina Chang. Language model fine-tuning on scaled survey data for predicting distributions of public opinions. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21147–21170, 2025
work page 2025
-
[33]
Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, and Kristina Gligorić. Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification. arXiv preprint arXiv:2510.11408, 2025
-
[34]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProc. 26th Annual International Conference on Machine Learning (ICML), pages 41–48, 2009. 20
work page 2009
-
[35]
XinWang, Yudong Chen, and Wenwu Zhu. A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021
work page 2021
-
[36]
Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022
Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022
work page 2022
-
[37]
Società editrice libraria, 1919
Vilfredo Pareto.Manuale di economia politica con una introduzione alla scienza sociale, vol- ume 13. Società editrice libraria, 1919
work page 1919
-
[38]
Sociobench: Modeling human behavior in sociological surveys with large language models
Jia Wang, Ziyu Zhao, Tingjuntao Ni, and Zhongyu Wei. Sociobench: Modeling human behavior in sociological surveys with large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26268–26300, 2025. 21
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.