pith. sign in

arxiv: 2605.23783 · v1 · pith:F4T4AZIKnew · submitted 2026-05-22 · 💻 cs.CY

Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3

classification 💻 cs.CY
keywords LLM simulationcommunity governancelife-history narrativescurriculum-LoRAfidelity-cost tradeoffparameter-efficient adaptationresident profilingpolicy evaluation
0
0 comments X

The pith

Curriculum-LoRA matches the strongest LLM simulation fidelity at roughly 10x lower per-call cost using life-history narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding rich first-person life histories from 92 detailed resident interviews raises how closely LLMs reproduce specific individuals' stated views on community governance issues. This gain requires longer prompts that increase token cost, creating a practical barrier for local use. Curriculum-LoRA, a parameter-efficient adaptation method, closes the gap by delivering equivalent fidelity at about one-tenth the cost and dominating other approaches on the cost-fidelity trade-off. The full pipeline then supports closed-loop testing of governance policies through simulation before real deployment, making individualized modeling feasible for resource-limited administrations.

Core claim

Collecting 1.2 million characters of interview data across nine governance domains and testing 18 LLMs shows that life-history profiles improve fidelity over demographic baselines but raise input costs. Curriculum-LoRA then achieves the highest baseline fidelity while cutting per-call cost by a factor of roughly 10 and Pareto-dominating every configuration tested, with the resulting system enabling in-silico pre-evaluation of community policies.

What carries the argument

curriculum-LoRA, a parameter-efficient personalization framework that adapts models to individual life-history profiles to generate resident-specific responses.

If this is right

  • Rich life-history profiles raise simulation fidelity above the no-profile baseline across the tested LLMs.
  • Standard prompting with full profiles increases input token counts and therefore per-call cost.
  • Curriculum-LoRA matches the strongest baseline fidelity at roughly 10 times lower per-call cost.
  • The method Pareto-dominates every prompting and adaptation configuration tested on the fidelity-cost frontier.
  • Individual-level resident simulation becomes reachable for resource-constrained local administrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interview dataset could be reused as a public benchmark for testing other personalization methods on governance attitudes.
  • Deployment in a real community decision process would test whether the fidelity scores translate into accurate forecasts of votes or survey responses.
  • Similar narrative-based adaptation might reduce costs in adjacent simulation settings such as patient preference modeling or student learning profiles.
  • Scaling the approach to additional residents could reveal systematic patterns linking life-history elements to attitude clusters.

Load-bearing premise

The benchmark's fidelity metric, which measures how closely LLM outputs match residents' interview statements, is a valid proxy for how well the simulations would predict actual resident behavior and preferences in real governance decisions.

What would settle it

Run a follow-up round of interviews or votes on new policy proposals with the same 92 residents and check whether the simulated responses from curriculum-LoRA models align with those actual answers at the reported fidelity levels.

Figures

Figures reproduced from arXiv: 2605.23783 by Anding Wang, Ji-Rong Wen, Lei Shi, Lei Wang, Nan Lu, Xiaoxing Fu, Xu Chen, Yang Wang, Yuanzi Li.

Figure 1
Figure 1. Figure 1: A life-history-grounded benchmark for individual-level resident simulation. a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmarking 18 mainstream LLMs for individual-resident simulation. a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generalization to unseen residents (a) and unseen governance domains (b). [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A closed-loop, end-to-end platform for policy simulation and optimization. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a dataset of ~1.2M characters of first-person life-history narratives from semi-structured interviews with 92 residents across nine community-governance domains; benchmarks 18 LLMs under four prompting strategies to show that rich profiles increase fidelity to residents' stated views (at the cost of longer prompts); proposes curriculum-LoRA, a parameter-efficient personalization method claimed to match the strongest baseline fidelity at roughly 10x lower per-call cost while Pareto-dominating tested configurations; and integrates the method into a closed-loop policy-evaluation pipeline for in-silico governance decisions.

Significance. If the reported fidelity-cost trade-off holds under rigorous evaluation and the interview-alignment metric proves predictive of real behavior, the work could lower barriers for resource-constrained local administrations to pre-test policies. The dataset and benchmark also supply a concrete testbed for personalization techniques. However, the absence of any external validation tying benchmark scores to observed votes, participation, or policy outcomes substantially limits the immediate applied significance.

major comments (2)
  1. [Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.
  2. [Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.
minor comments (2)
  1. [Abstract] The abstract refers to 'four prompting strategies' and '18 mainstream LLMs' without naming them or indicating where the full list and exact protocol appear; this should be clarified with a table or section reference for reproducibility.
  2. [Abstract] Notation for the fidelity metric and cost metric is not defined in the abstract; explicit definitions (even if deferred to §3) would help readers assess the Pareto-dominance claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment below with specific revisions where appropriate and clarifications on scope.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.

    Authors: We agree the abstract should include quantitative support. The main text (Section 4, Tables 2-4) reports mean fidelity scores with standard errors, per-token costs, and paired t-tests showing curriculum-LoRA matches the strongest baseline (p>0.05) at 9.8x lower cost while dominating all other configurations on the Pareto front. In revision we will insert these specific values, error bars, and test results into the abstract to make the claim self-contained. revision: yes

  2. Referee: [Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.

    Authors: The manuscript evaluates fidelity strictly to the collected interview narratives and does not claim or demonstrate correlation with external behavioral data. We will revise the abstract and system description to state explicitly that the in-silico pipeline is intended for simulation conditioned on interview profiles, with the acknowledged limitation that predictive validity for real-world actions remains untested in this study. revision: partial

standing simulated objections not resolved
  • Absence of external validation linking fidelity scores to observed votes, participation, or policy outcomes, as no such ground-truth data were collected.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct fidelity measurements

full rationale

The paper reports an empirical study: interview data collection (1.2M characters from 92 residents), benchmarking of 18 LLMs across prompting strategies, definition of fidelity as alignment with residents' stated interview views, and evaluation of curriculum-LoRA showing it matches baseline fidelity at lower cost. No equations, derivations, or first-principles claims exist. Performance results are measured outcomes on the held-out or tested interview responses rather than quantities defined in terms of fitted parameters or reduced by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central claims remain independent empirical findings on the provided dataset and benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. curriculum-LoRA implies training hyperparameters and a curriculum schedule, but none are enumerated.

pith-pipeline@v0.9.0 · 5816 in / 1183 out tokens · 24661 ms · 2026-05-25T02:47:53.101290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

  1. [1]

    Cambridge university press, 1990

    Elinor Ostrom.Governing the commons: The evolution of institutions for collective action. Cambridge university press, 1990

  2. [2]

    Bowling alone: The collapse and revival of american community.Simon Schuster, 2000

    Robert D Putnam. Bowling alone: The collapse and revival of american community.Simon Schuster, 2000

  3. [3]

    University of Michigan Press, 2000

    Joe Soss.Unwanted claims: The politics of participation in the US welfare system. University of Michigan Press, 2000

  4. [4]

    Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993

    Anne Schneider and Helen Ingram. Social construction of target populations: Implications for politics and policy.American Political Science Review, 87(2):334–347, 1993

  5. [5]

    Uni- versity of Chicago press, 2012

    Robert J Sampson.Great American city: Chicago and the enduring neighborhood effect. Uni- versity of Chicago press, 2012

  6. [6]

    ThomasHeberer. Evolvementofcitizenshipinurbanchinaorauthoritariancommunitarianism? neighborhood development, community participation, and autonomy.Journal of Contemporary China, 18(61):491–515, 2009

  7. [7]

    John Wiley & Sons, 2011

    Robert M Groves, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau.Survey methodology. John Wiley & Sons, 2011

  8. [8]

    Wiley, 4th edition, 2014

    Don A Dillman, Jolene D Smyth, and Leah Melani Christian.Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. Wiley, 4th edition, 2014. 18

  9. [9]

    Princeton University Press, 2019

    Matthew J Salganik.Bit by bit: Social research in the digital age. Princeton University Press, 2019

  10. [10]

    Automated social science: Language models as scientist and subjects

    Benjamin S Manning, Kehang Zhu, and John J Horton. Automated social science: Language models as scientist and subjects. Technical report, National Bureau of Economic Research, 2024

  11. [12]

    Whose opinions do language models reflect?Proc

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect?Proc. 40th International Confer- ence on Machine Learning (ICML), 2023

  12. [13]

    Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys

    ChenxiLin, Weikang Yuan, Zhuoren Jiang, BiaoHuang, RuitaoZhang, Jianan Ge, YueqianXu, and Jianxing Yu. Alignsurvey: A comprehensive benchmark for human preferences alignment in social surveys. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38908–38916, 2026

  13. [14]

    Using large language models to simulate multiple humans and replicate human subject studies

    Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, pages 337–371. PMLR, 2023

  14. [15]

    Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

    John J Horton, Apostolos Filippas, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

  15. [16]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  16. [17]

    User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, et al. User behavior simulation with large language model- based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

  17. [18]

    Simulating classroom education with llm-empowered agents

    Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianx- iao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  18. [19]

    Aligning language mod- els to user opinions.Findings of EMNLP, 2023

    EunJeong Hwang, Bodhisattwa Prasad Majumder, and Niket Tandon. Aligning language mod- els to user opinions.Findings of EMNLP, 2023

  19. [20]

    Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024

    Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Person- allm: Investigating the ability of large language models to express personality traits.Findings of NAACL, 2024

  20. [21]

    How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006

    Greg Guest, Arwen Bunce, and Laura Johnson. How many interviews are enough? an experi- ment with data saturation and variability.Field methods, 18(1):59–82, 2006. 19

  21. [22]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  22. [23]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  23. [24]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  24. [25]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  25. [26]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori- cut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  26. [27]

    Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, ArvindNeelakantan, PranavShyam, GirishSastry, AmandaAskell, etal. Languagemodels are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  27. [28]

    Toxicity in chatgpt: Analyzing persona-assigned language models

    Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. InFindings of the association for computational linguistics: EMNLP 2023, pages 1236–1270, 2023

  28. [29]

    Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

    Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

  29. [30]

    Assessing the reliability of persona-conditioned llms as synthetic survey respondents

    Erika Elizabeth Taday Morocho, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. Assessing the reliability of persona-conditioned llms as synthetic survey respondents. arXiv preprint arXiv:2602.18462, 2026

  30. [31]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  31. [32]

    Language model fine-tuning on scaled survey data for predicting distributions of public opinions

    Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, and Serina Chang. Language model fine-tuning on scaled survey data for predicting distributions of public opinions. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21147–21170, 2025

  32. [33]

    Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification

    Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, and Kristina Gligorić. Valid sur- vey simulations with limited human data: The roles of prompting, fine-tuning, and rectification. arXiv preprint arXiv:2510.11408, 2025

  33. [34]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProc. 26th Annual International Conference on Machine Learning (ICML), pages 41–48, 2009. 20

  34. [35]

    A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

    XinWang, Yudong Chen, and Wenwu Zhu. A survey oncurriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

  35. [36]

    Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022

    Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey.International Journal of Computer Vision, 130(6):1526–1565, 2022

  36. [37]

    Società editrice libraria, 1919

    Vilfredo Pareto.Manuale di economia politica con una introduzione alla scienza sociale, vol- ume 13. Società editrice libraria, 1919

  37. [38]

    Sociobench: Modeling human behavior in sociological surveys with large language models

    Jia Wang, Ziyu Zhao, Tingjuntao Ni, and Zhongyu Wei. Sociobench: Modeling human behavior in sociological surveys with large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26268–26300, 2025. 21