pith. sign in

arxiv: 2507.14922 · v2 · submitted 2025-07-20 · 💻 cs.CL

Synthia: Scalable Grounded Persona Generation from Social Media Data

Pith reviewed 2026-05-19 03:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords persona generationsocial media dataBlueskyLLM personasvirtual populationscomputational social sciencehomophilybias analysis
0
0 comments X

The pith

Synthia creates personas grounded in real Bluesky posts that better match human opinion distributions using smaller language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Synthia as a framework for building virtual populations by grounding LLM personas in actual posts from the Bluesky social network. It demonstrates improved alignment with human responses on social survey benchmarks compared to prior methods, while using smaller models. The method also maintains the real interaction structures among users, supporting network-based analyses like homophily studies. This addresses the challenge of creating authentic and scalable personas for computational social science simulations.

Core claim

Synthia grounds LLM-generated personas in real social-media posts from Bluesky while delegating narrative construction to language models. This produces virtual populations that align better with human opinion distributions across benchmarks, require substantially smaller models, show improved fairness across demographics, and maintain the original interaction graph structure among users for network-aware simulations.

What carries the argument

The Synthia framework that grounds personas directly in publicly available Bluesky posts and delegates narrative construction to language models.

If this is right

  • Virtual populations align more closely with real human survey responses on social topics.
  • Simulations can use smaller language models without losing performance.
  • Fairness improves across most demographic groups in multi-dimensional bias checks.
  • Network structures from real users are preserved, supporting studies of homophily and social interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar grounding techniques could be tested on data from other platforms to check if results hold beyond Bluesky.
  • The preserved interaction graphs open possibilities for simulating information diffusion in larger networks.
  • Smaller models might lower the computational cost of running large-scale persona simulations.
  • Future work could explore how well these personas predict behavior in new scenarios not covered by existing benchmarks.

Load-bearing premise

That personas derived from public Bluesky posts represent authentic and unbiased samples of broader human populations across different demographics and platforms.

What would settle it

A direct comparison of opinion distributions generated by Synthia personas against a large-scale human survey on a topic or demographic not included in the paper's benchmarks.

Figures

Figures reproduced from arXiv: 2507.14922 by Erfan Moosavi Monazzah, Mohammad Taher Pilehvar, Vahid Rahimzadeh, Yadollah Yaghoobzadeh.

Figure 1
Figure 1. Figure 1: Overview comparison of SYNTHIA and lead￾ing persona and backstory datasets. LLMs (Zhang et al., 2024; Salewski et al., 2023), with less emphasis placed on methods for generat￾ing high-quality persona populations. Approaches for creating personas represent a spectrum. As shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative example from SYNTHIA showing a backstory and its grounding social media posts. Highlights demonstrate how different spans of the backstory relate to their respective source posts. Bovet, 2024), we generate personas that balance au￾thenticity and scalability, with the potential to scale to millions of backstories (see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SYNTHIA pipeline. Our approach involves (1) collecting and filtering high-quality user data from open social networks, (2) splitting user histories into three temporal windows (10%, 50%, and 100% of activity) to generate backstories, (3) evaluating population alignment with real-world social surveys, and (4) analyzing temporal effects on backstory characteristics including perplexity, thema… view at source ↗
Figure 4
Figure 4. Figure 4: Age distribution comparison across different [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of named entity counts and perplexity values across different temporal windows in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrative example of backstory variation across temporal windows, highlighting changes in specificity [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for backstory generator model. A.3 Hyperparameters • For demographic surveying and ATP question answering, we maintained the default hyperpa￾rameters as specified in the original Anthology paper. • For backstory generation, we used a tempera￾ture setting of 0.1 and limited maximum token generation to 400 tokens. B Dataset Generation We compiled a huge dataset of Bluesky social me￾dia activi… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for demographic trait question: age [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for demographic trait question: gen [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for demographic trait question: in [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for demographic trait question: race [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution comparison of age demograph [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Distribution comparison of racial and ethnic [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Distribution comparison of education level [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Distribution comparison of household in [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
read the original abstract

Persona-driven simulations are increasingly used in computational social science, yet their validity critically depends on the fidelity of the underlying personas. Constructing virtual populations that are both authentic and scalable remains a central challenge. We introduce Synthia, a persona-generation framework that grounds LLM-generated personas in real social-media posts while delegating narrative construction to language models, using publicly available data from the Bluesky platform. Across multiple social-survey benchmarks, Synthia improves alignment with human opinion distributions over prior state-of-the-art approaches while relying on substantially smaller models. A multi-dimensional fairness and bias analysis shows that Synthia outperforms previous methods for most demographics across different dimensions. Uniquely, Synthia preserves interaction-graph structure among personas grounded in real social network users, enabling network-aware analysis, which we demonstrate through two homophily-focused case studies. Together, these results position Synthia as a practical and reliable framework for constructing scalable, high-fidelity, and equitable virtual populations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Synthia, a persona-generation framework that grounds LLM-generated personas in publicly available Bluesky social-media posts, delegating narrative construction to language models. It reports improved alignment with human opinion distributions on multiple social-survey benchmarks relative to prior state-of-the-art methods while using substantially smaller models, a multi-dimensional fairness and bias analysis showing outperformance for most demographics, and preservation of interaction-graph structure among grounded personas, demonstrated via two homophily-focused case studies.

Significance. If the alignment and fairness results hold after addressing potential confounds, Synthia would provide a practical, scalable method for constructing high-fidelity virtual populations that also support network-aware analyses, strengthening the validity of persona-driven simulations in computational social science.

major comments (2)
  1. [Evaluation and Benchmarks] Evaluation section: the reported gains in alignment with human opinion distributions on general-population social-survey benchmarks are not accompanied by reweighting, stratification, or explicit matching on demographics (age, education, ideology) to the target survey marginals. Given the documented self-selection characteristics of Bluesky users, this leaves open the possibility that improvements reflect platform-specific distribution shift rather than the grounding technique itself.
  2. [Fairness and Bias Analysis] Fairness and bias analysis: while multi-dimensional fairness results are presented, the section does not include explicit tests of transportability (e.g., cross-platform or cross-population generalization) or sensitivity checks that would isolate the contribution of the grounding method from Bluesky-specific selection effects.
minor comments (2)
  1. [Abstract] The abstract states that Synthia relies on 'substantially smaller models' but does not report the specific model sizes, parameter counts, or exact baseline models used for comparison.
  2. [Case Studies] The homophily case studies would benefit from additional detail on the quantitative metrics employed to demonstrate preservation of interaction-graph structure (e.g., specific graph statistics or statistical tests).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major concerns below and have made revisions to strengthen the evaluation and analysis sections.

read point-by-point responses
  1. Referee: [Evaluation and Benchmarks] Evaluation section: the reported gains in alignment with human opinion distributions on general-population social-survey benchmarks are not accompanied by reweighting, stratification, or explicit matching on demographics (age, education, ideology) to the target survey marginals. Given the documented self-selection characteristics of Bluesky users, this leaves open the possibility that improvements reflect platform-specific distribution shift rather than the grounding technique itself.

    Authors: We agree that potential platform-specific selection effects represent a valid concern that could confound the interpretation of our results. In the revised manuscript, we have incorporated post-stratification reweighting based on inferred demographic attributes from user profiles and posts where available. We now report alignment metrics both before and after reweighting to the target survey distributions. These additional analyses indicate that the improvements attributable to the grounding method remain significant even after accounting for demographic shifts. We have also added explicit discussion of Bluesky's user demographics as a limitation of the current study. revision: yes

  2. Referee: [Fairness and Bias Analysis] Fairness and bias analysis: while multi-dimensional fairness results are presented, the section does not include explicit tests of transportability (e.g., cross-platform or cross-population generalization) or sensitivity checks that would isolate the contribution of the grounding method from Bluesky-specific selection effects.

    Authors: We appreciate the suggestion to include transportability tests. Full cross-platform generalization would necessitate comparable grounded datasets from other platforms, which falls outside the scope of the present work focused on Bluesky data. To address the concern within our available resources, we have added sensitivity analyses in the revised fairness section. These include varying the proportion of grounded content and comparing performance against non-grounded LLM baselines on the same Bluesky-derived population. The results help isolate the grounding contribution. We have also included a dedicated subsection discussing potential selection effects and outlining future directions for cross-population validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical validation on external benchmarks

full rationale

The paper introduces Synthia as a grounding framework that delegates narrative construction to LLMs while using real Bluesky posts as input. All reported improvements in alignment with human opinion distributions and fairness metrics are obtained via direct comparison against prior state-of-the-art methods on independent social-survey benchmarks. No equations, fitted parameters, or uniqueness theorems are defined in terms of the target results; the interaction-graph preservation is a direct consequence of the input data structure rather than a derived prediction. The derivation chain is therefore self-contained against external data and does not reduce to self-definition or self-citation load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that real social-media posts supply sufficient grounding for authentic personas; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Bluesky social media posts provide representative grounding for authentic and generalizable personas
    This premise underpins the fidelity and fairness claims.

pith-pipeline@v0.9.0 · 5710 in / 1141 out tokens · 50396 ms · 2026-05-19T03:35:14.969286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Graph-Based Alternatives to LLMs for Human Simulation

    cs.CL 2025-11 conditional novelty 6.0

    GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three ...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351

  2. [2]

    Julia Barnett, Kimon Kieslich, and Nicholas Diakopoulos. 2024. Simulating policy impacts: Developing a generative scenario writing method to evaluate the perceived effects of regulation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 82--93

  3. [3]

    Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024. https://arxiv.org/abs/2404.18231 From persona to personalization: A survey on role-playing language agents . Preprint, arXiv:2404.18231

  4. [4]

    Zihao He, Minh Duc Chu, Rebecca Dorn, Siyi Guo, and Kristina Lerman. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.945 Community-cross-instruct: Unsupervised instruction generation for aligning large language models to online communities . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17001--17019, Mi...

  5. [5]

    EunJeong Hwang, Bodhisattwa Majumder, and Niket Tandon. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.393 Aligning language models to user opinions . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5906--5919, Singapore. Association for Computational Linguistics

  6. [6]

    Andy Liu, Mona Diab, and Daniel Fried. 2024. https://doi.org/10.18653/v1/2024.findings-acl.586 Evaluating large language model biases in persona-steered generation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 9832--9850, Bangkok, Thailand. Association for Computational Linguistics

  7. [7]

    Lisa Messeri and MJ Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58

  8. [8]

    Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, and David Chan. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1110 Virtual personas for language models via an anthology of backstories . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19864--19897, Miami, Fl...

  9. [9]

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

  10. [10]

    Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people. arXiv preprint arXiv:2411.10109

  11. [11]

    Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch \"o lkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems, 37:111715--111759

  12. [12]

    Dorian Quelle and Alexandre Bovet. 2024. https://api.semanticscholar.org/CorpusID:270068060 Bluesky: Network topology, polarization, and algorithmic curation. PloS one, 20 2:e0318034

  13. [13]

    Vahid Rahimzadeh, Ali Hamzehpour, Azadeh Shakery, and Masoud Asadpour. 2025. From millions of tweets to actionable insights: Leveraging llms for user profiling. arXiv preprint arXiv:2505.06184

  14. [14]

    Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. 2023. In-context impersonation reveals large language models' strengths and biases. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

  15. [15]

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

  16. [16]

    Christine V Stephens and Mary Breheny. 2013. https://api.semanticscholar.org/CorpusID:145289700 Narrative analysis in psychological research: An integrated approach to interpreting stories . Qualitative Research in Psychology, 10:14 -- 27

  17. [17]

    Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, and 1 others. 2024. A simulation system towards solving societal-scale manipulation. arXiv preprint arXiv:2410.13915

  18. [18]

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.969 Two tales of persona in LLM s: A survey of role-playing and personalization . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16612--16631, Miami, Florida, USA. Ass...

  19. [19]

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. 2025 a . https://doi.org/10.1145/3708985 User behavior simulation with large language model-based agents . ACM Trans. Inf. Syst., 43(2)

  20. [20]

    Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, and Frederick L Oswald. 2025 b . Personality structured interview for large language model simulation in personality research. arXiv preprint arXiv:2502.12109

  21. [21]

    Dulce Wilkinson Westberg, Moin Syed, Aerika Brittian Loyd, and William Dunlop. 2024. https://api.semanticscholar.org/CorpusID:273797244 Using intersectionality to understand how structural domains are embedded in life narratives. Journal of personality

  22. [22]

    Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. 2024. https://arxiv.org/abs/2404.12138 Character is destiny: Can role-playing language agents make persona-driven decisions? Preprint, arXiv:2404.12138

  23. [23]

    Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Yankai Lin, and Jiaxin Mao. 2024. https://doi.org/10.1145/3626772.3657963 Usimagent: Large language models for simulating search users . In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24, page 2687–2692, New York, NY, USA. Association for C...

  24. [24]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  25. [25]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...