Synthia: Scalable Grounded Persona Generation from Social Media Data

Erfan Moosavi Monazzah; Mohammad Taher Pilehvar; Vahid Rahimzadeh; Yadollah Yaghoobzadeh

arxiv: 2507.14922 · v2 · submitted 2025-07-20 · 💻 cs.CL

Synthia: Scalable Grounded Persona Generation from Social Media Data

Vahid Rahimzadeh , Erfan Moosavi Monazzah , Mohammad Taher Pilehvar , Yadollah Yaghoobzadeh This is my paper

Pith reviewed 2026-05-19 03:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords persona generationsocial media dataBlueskyLLM personasvirtual populationscomputational social sciencehomophilybias analysis

0 comments

The pith

Synthia creates personas grounded in real Bluesky posts that better match human opinion distributions using smaller language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Synthia as a framework for building virtual populations by grounding LLM personas in actual posts from the Bluesky social network. It demonstrates improved alignment with human responses on social survey benchmarks compared to prior methods, while using smaller models. The method also maintains the real interaction structures among users, supporting network-based analyses like homophily studies. This addresses the challenge of creating authentic and scalable personas for computational social science simulations.

Core claim

Synthia grounds LLM-generated personas in real social-media posts from Bluesky while delegating narrative construction to language models. This produces virtual populations that align better with human opinion distributions across benchmarks, require substantially smaller models, show improved fairness across demographics, and maintain the original interaction graph structure among users for network-aware simulations.

What carries the argument

The Synthia framework that grounds personas directly in publicly available Bluesky posts and delegates narrative construction to language models.

If this is right

Virtual populations align more closely with real human survey responses on social topics.
Simulations can use smaller language models without losing performance.
Fairness improves across most demographic groups in multi-dimensional bias checks.
Network structures from real users are preserved, supporting studies of homophily and social interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar grounding techniques could be tested on data from other platforms to check if results hold beyond Bluesky.
The preserved interaction graphs open possibilities for simulating information diffusion in larger networks.
Smaller models might lower the computational cost of running large-scale persona simulations.
Future work could explore how well these personas predict behavior in new scenarios not covered by existing benchmarks.

Load-bearing premise

That personas derived from public Bluesky posts represent authentic and unbiased samples of broader human populations across different demographics and platforms.

What would settle it

A direct comparison of opinion distributions generated by Synthia personas against a large-scale human survey on a topic or demographic not included in the paper's benchmarks.

Figures

Figures reproduced from arXiv: 2507.14922 by Erfan Moosavi Monazzah, Mohammad Taher Pilehvar, Vahid Rahimzadeh, Yadollah Yaghoobzadeh.

**Figure 1.** Figure 1: Overview comparison of SYNTHIA and leading persona and backstory datasets. LLMs (Zhang et al., 2024; Salewski et al., 2023), with less emphasis placed on methods for generating high-quality persona populations. Approaches for creating personas represent a spectrum. As shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Illustrative example from SYNTHIA showing a backstory and its grounding social media posts. Highlights demonstrate how different spans of the backstory relate to their respective source posts. Bovet, 2024), we generate personas that balance authenticity and scalability, with the potential to scale to millions of backstories (see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the SYNTHIA pipeline. Our approach involves (1) collecting and filtering high-quality user data from open social networks, (2) splitting user histories into three temporal windows (10%, 50%, and 100% of activity) to generate backstories, (3) evaluating population alignment with real-world social surveys, and (4) analyzing temporal effects on backstory characteristics including perplexity, thema… view at source ↗

**Figure 4.** Figure 4: Age distribution comparison across different [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of named entity counts and perplexity values across different temporal windows in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Illustrative example of backstory variation across temporal windows, highlighting changes in specificity [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for backstory generator model. A.3 Hyperparameters • For demographic surveying and ATP question answering, we maintained the default hyperparameters as specified in the original Anthology paper. • For backstory generation, we used a temperature setting of 0.1 and limited maximum token generation to 400 tokens. B Dataset Generation We compiled a huge dataset of Bluesky social media activi… view at source ↗

**Figure 9.** Figure 9: Prompt for demographic trait question: age [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for demographic trait question: gen [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: Prompt for demographic trait question: in [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for demographic trait question: race [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 15.** Figure 15: Distribution comparison of age demograph [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Distribution comparison of racial and ethnic [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Distribution comparison of education level [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

**Figure 18.** Figure 18: Distribution comparison of household in [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

read the original abstract

Persona-driven simulations are increasingly used in computational social science, yet their validity critically depends on the fidelity of the underlying personas. Constructing virtual populations that are both authentic and scalable remains a central challenge. We introduce Synthia, a persona-generation framework that grounds LLM-generated personas in real social-media posts while delegating narrative construction to language models, using publicly available data from the Bluesky platform. Across multiple social-survey benchmarks, Synthia improves alignment with human opinion distributions over prior state-of-the-art approaches while relying on substantially smaller models. A multi-dimensional fairness and bias analysis shows that Synthia outperforms previous methods for most demographics across different dimensions. Uniquely, Synthia preserves interaction-graph structure among personas grounded in real social network users, enabling network-aware analysis, which we demonstrate through two homophily-focused case studies. Together, these results position Synthia as a practical and reliable framework for constructing scalable, high-fidelity, and equitable virtual populations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Synthia grounds personas in Bluesky posts and keeps the interaction graph for homophily work, with reported gains on survey alignment using smaller models, but platform selection effects need direct checks.

read the letter

The main point is that Synthia pulls personas straight from real Bluesky posts, hands the narrative to an LLM, and keeps the original user interaction graph so you can run network studies like homophily without starting from scratch. It also claims tighter matches to human opinion distributions on several benchmarks than earlier methods, all while using smaller models and adding a multi-dimensional fairness check across demographics. That combination is what sets it apart from most prior persona work that either relies on synthetic templates or loses the social structure. The graph preservation is the clearest practical addition here, since it directly supports the kind of network-aware simulations people in computational social science actually want to run. The fairness analysis is also worth noting because it tries to quantify how the generated population behaves across groups rather than just reporting overall averages. Those pieces look like genuine forward steps for building scalable virtual populations from public data. The soft spot is the one the stress test flags. Bluesky users skew younger, more educated, and tech-oriented, and the benchmarks are general-population surveys. If the paper does not reweight or stratify the personas to match the target marginals on age, education, or ideology, then some of the reported alignment gains could simply reflect that platform distribution rather than the grounding method itself. The fairness numbers do not automatically fix this unless they include explicit transportability tests or cross-platform comparisons. I would also want to see the exact baseline implementations, data splits, and statistical details to judge whether the improvements hold up under different choices. This work is for people building agent-based models or large-scale opinion simulations who need both scale and some network fidelity. A reader working on virtual populations or homophily studies would get concrete value from the pipeline and the case studies even if they end up adjusting for platform bias in their own use. It deserves a serious referee because the core technical choices are clear, the graph retention is reproducible in principle, and the fairness angle addresses a real concern in the field. The paper is ready for review with targeted revisions on generalizability.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Synthia, a persona-generation framework that grounds LLM-generated personas in publicly available Bluesky social-media posts, delegating narrative construction to language models. It reports improved alignment with human opinion distributions on multiple social-survey benchmarks relative to prior state-of-the-art methods while using substantially smaller models, a multi-dimensional fairness and bias analysis showing outperformance for most demographics, and preservation of interaction-graph structure among grounded personas, demonstrated via two homophily-focused case studies.

Significance. If the alignment and fairness results hold after addressing potential confounds, Synthia would provide a practical, scalable method for constructing high-fidelity virtual populations that also support network-aware analyses, strengthening the validity of persona-driven simulations in computational social science.

major comments (2)

[Evaluation and Benchmarks] Evaluation section: the reported gains in alignment with human opinion distributions on general-population social-survey benchmarks are not accompanied by reweighting, stratification, or explicit matching on demographics (age, education, ideology) to the target survey marginals. Given the documented self-selection characteristics of Bluesky users, this leaves open the possibility that improvements reflect platform-specific distribution shift rather than the grounding technique itself.
[Fairness and Bias Analysis] Fairness and bias analysis: while multi-dimensional fairness results are presented, the section does not include explicit tests of transportability (e.g., cross-platform or cross-population generalization) or sensitivity checks that would isolate the contribution of the grounding method from Bluesky-specific selection effects.

minor comments (2)

[Abstract] The abstract states that Synthia relies on 'substantially smaller models' but does not report the specific model sizes, parameter counts, or exact baseline models used for comparison.
[Case Studies] The homophily case studies would benefit from additional detail on the quantitative metrics employed to demonstrate preservation of interaction-graph structure (e.g., specific graph statistics or statistical tests).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major concerns below and have made revisions to strengthen the evaluation and analysis sections.

read point-by-point responses

Referee: [Evaluation and Benchmarks] Evaluation section: the reported gains in alignment with human opinion distributions on general-population social-survey benchmarks are not accompanied by reweighting, stratification, or explicit matching on demographics (age, education, ideology) to the target survey marginals. Given the documented self-selection characteristics of Bluesky users, this leaves open the possibility that improvements reflect platform-specific distribution shift rather than the grounding technique itself.

Authors: We agree that potential platform-specific selection effects represent a valid concern that could confound the interpretation of our results. In the revised manuscript, we have incorporated post-stratification reweighting based on inferred demographic attributes from user profiles and posts where available. We now report alignment metrics both before and after reweighting to the target survey distributions. These additional analyses indicate that the improvements attributable to the grounding method remain significant even after accounting for demographic shifts. We have also added explicit discussion of Bluesky's user demographics as a limitation of the current study. revision: yes
Referee: [Fairness and Bias Analysis] Fairness and bias analysis: while multi-dimensional fairness results are presented, the section does not include explicit tests of transportability (e.g., cross-platform or cross-population generalization) or sensitivity checks that would isolate the contribution of the grounding method from Bluesky-specific selection effects.

Authors: We appreciate the suggestion to include transportability tests. Full cross-platform generalization would necessitate comparable grounded datasets from other platforms, which falls outside the scope of the present work focused on Bluesky data. To address the concern within our available resources, we have added sensitivity analyses in the revised fairness section. These include varying the proportion of grounded content and comparing performance against non-grounded LLM baselines on the same Bluesky-derived population. The results help isolate the grounding contribution. We have also included a dedicated subsection discussing potential selection effects and outlining future directions for cross-population validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical validation on external benchmarks

full rationale

The paper introduces Synthia as a grounding framework that delegates narrative construction to LLMs while using real Bluesky posts as input. All reported improvements in alignment with human opinion distributions and fairness metrics are obtained via direct comparison against prior state-of-the-art methods on independent social-survey benchmarks. No equations, fitted parameters, or uniqueness theorems are defined in terms of the target results; the interaction-graph preservation is a direct consequence of the input data structure rather than a derived prediction. The derivation chain is therefore self-contained against external data and does not reduce to self-definition or self-citation load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that real social-media posts supply sufficient grounding for authentic personas; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Bluesky social media posts provide representative grounding for authentic and generalizable personas
This premise underpins the fidelity and fairness claims.

pith-pipeline@v0.9.0 · 5710 in / 1141 out tokens · 50396 ms · 2026-05-19T03:35:14.969286+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SYNTHIA comprises 30K personas, realized through backstories synthesized from the content of 10K real human users across three distinct time windows... preserves interaction-graph structure among personas grounded in real social network users
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate the internal consistency of information within each backstory... using a capable LLM as a Judge

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Graph-Based Alternatives to LLMs for Human Simulation
cs.CL 2025-11 conditional novelty 6.0

GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three ...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351

work page 2023
[2]

Julia Barnett, Kimon Kieslich, and Nicholas Diakopoulos. 2024. Simulating policy impacts: Developing a generative scenario writing method to evaluate the perceived effects of regulation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 82--93

work page 2024
[3]

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024. https://arxiv.org/abs/2404.18231 From persona to personalization: A survey on role-playing language agents . Preprint, arXiv:2404.18231

work page arXiv 2024
[4]

Zihao He, Minh Duc Chu, Rebecca Dorn, Siyi Guo, and Kristina Lerman. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.945 Community-cross-instruct: Unsupervised instruction generation for aligning large language models to online communities . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17001--17019, Mi...

work page doi:10.18653/v1/2024.emnlp-main.945 2024
[5]

EunJeong Hwang, Bodhisattwa Majumder, and Niket Tandon. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.393 Aligning language models to user opinions . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5906--5919, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.393 2023
[6]

Andy Liu, Mona Diab, and Daniel Fried. 2024. https://doi.org/10.18653/v1/2024.findings-acl.586 Evaluating large language model biases in persona-steered generation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 9832--9850, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-acl.586 2024
[7]

Lisa Messeri and MJ Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58

work page 2024
[8]

Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, and David Chan. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1110 Virtual personas for language models via an anthology of backstories . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19864--19897, Miami, Fl...

work page doi:10.18653/v1/2024.emnlp-main.1110 2024
[9]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

work page 2023
[10]

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people. arXiv preprint arXiv:2411.10109

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch \"o lkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems, 37:111715--111759

work page 2024
[12]

Dorian Quelle and Alexandre Bovet. 2024. https://api.semanticscholar.org/CorpusID:270068060 Bluesky: Network topology, polarization, and algorithmic curation. PloS one, 20 2:e0318034

work page 2024
[13]

Vahid Rahimzadeh, Ali Hamzehpour, Azadeh Shakery, and Masoud Asadpour. 2025. From millions of tweets to actionable insights: Leveraging llms for user profiling. arXiv preprint arXiv:2505.06184

work page arXiv 2025
[14]

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. 2023. In-context impersonation reveals large language models' strengths and biases. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

work page 2023
[15]

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

work page 2023
[16]

Christine V Stephens and Mary Breheny. 2013. https://api.semanticscholar.org/CorpusID:145289700 Narrative analysis in psychological research: An integrated approach to interpreting stories . Qualitative Research in Psychology, 10:14 -- 27

work page 2013
[17]

Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, and 1 others. 2024. A simulation system towards solving societal-scale manipulation. arXiv preprint arXiv:2410.13915

work page arXiv 2024
[18]

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.969 Two tales of persona in LLM s: A survey of role-playing and personalization . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16612--16631, Miami, Florida, USA. Ass...

work page doi:10.18653/v1/2024.findings-emnlp.969 2024
[19]

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. 2025 a . https://doi.org/10.1145/3708985 User behavior simulation with large language model-based agents . ACM Trans. Inf. Syst., 43(2)

work page doi:10.1145/3708985 2025
[20]

Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, and Frederick L Oswald. 2025 b . Personality structured interview for large language model simulation in personality research. arXiv preprint arXiv:2502.12109

work page arXiv 2025
[21]

Dulce Wilkinson Westberg, Moin Syed, Aerika Brittian Loyd, and William Dunlop. 2024. https://api.semanticscholar.org/CorpusID:273797244 Using intersectionality to understand how structural domains are embedded in life narratives. Journal of personality

work page 2024
[22]

Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. 2024. https://arxiv.org/abs/2404.12138 Character is destiny: Can role-playing language agents make persona-driven decisions? Preprint, arXiv:2404.12138

work page arXiv 2024
[23]

Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Yankai Lin, and Jiaxin Mao. 2024. https://doi.org/10.1145/3626772.3657963 Usimagent: Large language models for simulating search users . In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24, page 2687–2692, New York, NY, USA. Association for C...

work page doi:10.1145/3626772.3657963 2024
[24]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351

work page 2023

[2] [2]

Julia Barnett, Kimon Kieslich, and Nicholas Diakopoulos. 2024. Simulating policy impacts: Developing a generative scenario writing method to evaluate the perceived effects of regulation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 82--93

work page 2024

[3] [3]

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024. https://arxiv.org/abs/2404.18231 From persona to personalization: A survey on role-playing language agents . Preprint, arXiv:2404.18231

work page arXiv 2024

[4] [4]

Zihao He, Minh Duc Chu, Rebecca Dorn, Siyi Guo, and Kristina Lerman. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.945 Community-cross-instruct: Unsupervised instruction generation for aligning large language models to online communities . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17001--17019, Mi...

work page doi:10.18653/v1/2024.emnlp-main.945 2024

[5] [5]

EunJeong Hwang, Bodhisattwa Majumder, and Niket Tandon. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.393 Aligning language models to user opinions . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5906--5919, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.393 2023

[6] [6]

Andy Liu, Mona Diab, and Daniel Fried. 2024. https://doi.org/10.18653/v1/2024.findings-acl.586 Evaluating large language model biases in persona-steered generation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 9832--9850, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-acl.586 2024

[7] [7]

Lisa Messeri and MJ Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58

work page 2024

[8] [8]

Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, and David Chan. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1110 Virtual personas for language models via an anthology of backstories . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19864--19897, Miami, Fl...

work page doi:10.18653/v1/2024.emnlp-main.1110 2024

[9] [9]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

work page 2023

[10] [10]

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people. arXiv preprint arXiv:2411.10109

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch \"o lkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems, 37:111715--111759

work page 2024

[12] [12]

Dorian Quelle and Alexandre Bovet. 2024. https://api.semanticscholar.org/CorpusID:270068060 Bluesky: Network topology, polarization, and algorithmic curation. PloS one, 20 2:e0318034

work page 2024

[13] [13]

Vahid Rahimzadeh, Ali Hamzehpour, Azadeh Shakery, and Masoud Asadpour. 2025. From millions of tweets to actionable insights: Leveraging llms for user profiling. arXiv preprint arXiv:2505.06184

work page arXiv 2025

[14] [14]

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. 2023. In-context impersonation reveals large language models' strengths and biases. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

work page 2023

[15] [15]

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

work page 2023

[16] [16]

Christine V Stephens and Mary Breheny. 2013. https://api.semanticscholar.org/CorpusID:145289700 Narrative analysis in psychological research: An integrated approach to interpreting stories . Qualitative Research in Psychology, 10:14 -- 27

work page 2013

[17] [17]

Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, and 1 others. 2024. A simulation system towards solving societal-scale manipulation. arXiv preprint arXiv:2410.13915

work page arXiv 2024

[18] [18]

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.969 Two tales of persona in LLM s: A survey of role-playing and personalization . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16612--16631, Miami, Florida, USA. Ass...

work page doi:10.18653/v1/2024.findings-emnlp.969 2024

[19] [19]

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. 2025 a . https://doi.org/10.1145/3708985 User behavior simulation with large language model-based agents . ACM Trans. Inf. Syst., 43(2)

work page doi:10.1145/3708985 2025

[20] [20]

Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, and Frederick L Oswald. 2025 b . Personality structured interview for large language model simulation in personality research. arXiv preprint arXiv:2502.12109

work page arXiv 2025

[21] [21]

Dulce Wilkinson Westberg, Moin Syed, Aerika Brittian Loyd, and William Dunlop. 2024. https://api.semanticscholar.org/CorpusID:273797244 Using intersectionality to understand how structural domains are embedded in life narratives. Journal of personality

work page 2024

[22] [22]

Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. 2024. https://arxiv.org/abs/2404.12138 Character is destiny: Can role-playing language agents make persona-driven decisions? Preprint, arXiv:2404.12138

work page arXiv 2024

[23] [23]

Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Yankai Lin, and Jiaxin Mao. 2024. https://doi.org/10.1145/3626772.3657963 Usimagent: Large language models for simulating search users . In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24, page 2687–2692, New York, NY, USA. Association for C...

work page doi:10.1145/3626772.3657963 2024

[24] [24]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[25] [25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page