pith. sign in

arxiv: 2507.04005 · v4 · submitted 2025-07-05 · 💻 cs.HC · cs.CY

Exploring a Gamified Personality Assessment Method through Interaction with LLM Agents Embodying Different Personalities

Pith reviewed 2026-05-19 06:31 UTC · model grok-4.3

classification 💻 cs.HC cs.CY
keywords personality assessmentLLM agentsgamificationBig Five personalitymulti-personality representationhuman-computer interactioninteractive gamesassessment bias
0
0 comments X p. Extension

The pith

Gamified interactions with multiple LLM agents embodying different personalities yield effective and more accurate personality assessments based on the Big Five model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that turns personality assessment into a game where users interact with several LLM agents, each set to represent a different personality. These interactions generate varied textual data that the system analyzes to score users on the classic Big Five traits while also producing explanations. The authors report that this multi-agent setup works better than simpler single-context approaches and that combining data from several agents reduces some of the systematic errors that appear when LLMs judge personality on their own. A user study supports the claim that the method is practical and low-intrusion for automated assessment in psychology and human-computer interaction.

Core claim

The Multi-PR GPA framework uses Large Language Models to create virtual agents with distinct personalities that engage users in interactive games; the resulting multi-type textual data then supports Big Five personality assessments that are both effective and interpretable, with superior results when the multiplicity of personality representations is taken into account and with partial mitigation of LLM assessment biases through multi-context aggregation.

What carries the argument

Multi-PR GPA framework that deploys several LLM agents with varied personalities to run gamified interactions and aggregates the generated textual data for Big Five scoring and bias analysis.

If this is right

  • The approach supports low-intrusion automated personality assessment suitable for psychology and HCI applications.
  • Multi-personality representation produces superior assessment performance compared with single-context methods.
  • Multi-context aggregation partially corrects systematic biases present in LLM personality judgments.
  • The generated multi-type textual data supplies interpretable insights alongside the trait scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interaction logs could be reused to study how personality expression changes across different game contexts.
  • Developers might embed the method in apps to make personality feedback more engaging than standard questionnaires.
  • The bias findings suggest a general need to test LLM outputs for consistency when they are used as proxies for human judgment.

Load-bearing premise

Interactions with LLM agents can draw out genuine and multifaceted human personality signals without the agents' own training biases dominating the results.

What would settle it

A controlled comparison in which the game-based scores show no stronger correlation with participants' established Big Five questionnaire results or real-world behavioral markers than scores from a single-personality LLM agent.

Figures

Figures reproduced from arXiv: 2507.04005 by Baiqiao Zhang, Chao Zhou, Juan Liu, Nianlong Li, Shuai Ma, Xiangxian Li, Xiaojuan Ma, Xinyu Gai, Xue Yang, Yong-jin Liu, Yulong Bian.

Figure 1
Figure 1. Figure 1: A framework of gamified personality assessment through interacting with multi-personality agents. The icons above [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework includes Gamified Interaction (orange block, section 3.1), LLM Agent with Controlled Personality [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The design of our prototype system for gamified personality assessment based on Multi-PR GPA framework, which [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The user interface of the prototype system used in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case of the cognitive architecture applied in the agent’s decision-making process during a game round. The orange [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Workflow of the Direct and Questionnaire-based Assessment process for evaluating the player’s personality traits [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the experimental procedure. ethical review process, ensuring that all participants were informed of potential risks before the experiment and voluntarily consented to participate. In addition, we strictly adhere to privacy protection and data confidentiality principles to ensure the safeguarding of all personal data. 5 User study To validate the effectiveness of the proposed Multi-PR GPA frame￾… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the Big Five Personality traits, showcas [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

The low-intrusion and automated personality assessment is receiving increasing attention in psychology and human-computer interaction fields. This study explores an interactive approach for personality assessment, focusing on the multiplicity of personality representation. We propose a framework of Gamified Personality Assessment through Multi-Personality Representations (Multi-PR GPA). The framework leverages Large Language Models to empower virtual agents with different personalities. These agents elicit multifaceted human personality representations through engaging in interactive games. Drawing upon the multi-type textual data generated throughout the interaction, it achieves personality assessments with interpretable insights. Grounded in the classic Big Five personality theory, we developed a prototype system and conducted a user study to evaluate the efficacy of Multi-PR GPA. The results affirm the effectiveness of our approach in personality assessment and demonstrate its superior performance when considering the multiplicity of personality representation. Error structure analysis further revealed systematic assessment biases in LLMs, which multi-context aggregation partially mitigated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the Multi-PR GPA framework, which employs LLM agents embodying varied personalities to engage users in gamified interactions. These interactions generate multi-type textual data for assessing Big Five personality traits, yielding interpretable insights. A prototype system was evaluated via user study, with results claimed to affirm effectiveness, demonstrate superiority when accounting for personality multiplicity, and show that multi-context aggregation partially mitigates systematic LLM assessment biases.

Significance. If the empirical claims are substantiated through proper anchoring to established instruments, the work could meaningfully advance low-intrusion, interactive personality assessment methods in HCI and psychology. The focus on multiplicity of representation and the error-structure analysis of LLM biases represent potentially useful contributions. However, the absence of key validation details currently constrains the assessed significance and generalizability.

major comments (3)
  1. [User Study] User Study section: Sample size, participant demographics, recruitment method, and any statistical tests (e.g., significance levels or effect sizes) are not reported despite the abstract's claims of positive results and bias mitigation. These omissions prevent evaluation of the reliability and generalizability of the reported effectiveness.
  2. [Results] Results section: The central claim of superior performance with multi-personality representations and partial bias mitigation lacks quantitative grounding such as correlation coefficients with validated instruments (NEO-PI-R or IPIP-NEO), inter-rater reliability, or ablation comparisons against single-personality baselines. Without these metrics, it remains unclear whether observed signals reflect user personality variance or consistent LLM response patterns.
  3. [Methodology] Methodology section: The exact scoring algorithms used to derive Big Five trait scores from the collected multi-context textual data, including the implementation of multi-context aggregation, are not specified. This hinders reproducibility and assessment of how the framework operationalizes personality assessment.
minor comments (2)
  1. [Abstract] The abstract would benefit from a concise statement of the user-study scale or primary quantitative outcomes to better contextualize the reported findings.
  2. [Framework Description] Notation for how textual features map to specific Big Five facets could be clarified for readers unfamiliar with the LLM prompting setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The comments highlight important areas for improving the transparency and rigor of our empirical sections. We have revised the manuscript accordingly to address each point while preserving the core contributions of the Multi-PR GPA framework.

read point-by-point responses
  1. Referee: [User Study] User Study section: Sample size, participant demographics, recruitment method, and any statistical tests (e.g., significance levels or effect sizes) are not reported despite the abstract's claims of positive results and bias mitigation. These omissions prevent evaluation of the reliability and generalizability of the reported effectiveness.

    Authors: We agree that these details should have been reported more explicitly. The revised User Study section now includes the sample size, full participant demographics, recruitment procedures (via online research platforms and institutional channels), and the statistical tests performed, including significance levels and effect sizes supporting the abstract claims. These additions directly improve evaluability of reliability and generalizability. revision: yes

  2. Referee: [Results] Results section: The central claim of superior performance with multi-personality representations and partial bias mitigation lacks quantitative grounding such as correlation coefficients with validated instruments (NEO-PI-R or IPIP-NEO), inter-rater reliability, or ablation comparisons against single-personality baselines. Without these metrics, it remains unclear whether observed signals reflect user personality variance or consistent LLM response patterns.

    Authors: We acknowledge the need for stronger quantitative anchoring. The revised Results section now incorporates correlation coefficients with a validated instrument, inter-rater reliability statistics, and explicit ablation comparisons between multi-personality and single-personality conditions. These metrics demonstrate that the performance gains arise from capturing user personality variance rather than LLM artifacts alone, while the error-structure analysis quantifies the partial bias mitigation achieved through multi-context aggregation. revision: yes

  3. Referee: [Methodology] Methodology section: The exact scoring algorithms used to derive Big Five trait scores from the collected multi-context textual data, including the implementation of multi-context aggregation, are not specified. This hinders reproducibility and assessment of how the framework operationalizes personality assessment.

    Authors: We agree that greater specificity is required for reproducibility. The revised Methodology section now details the exact scoring algorithms, including the prompt templates, trait extraction rules, and the multi-context aggregation procedure (weighted combination of scores across interaction contexts based on relevance and consistency). This makes the operationalization of the personality assessment fully transparent and replicable. revision: yes

Circularity Check

0 steps flagged

Empirical user study shows no circular derivations or self-referential reductions

full rationale

The paper describes a gamified personality assessment framework (Multi-PR GPA) that uses LLM agents to elicit user responses in interactive games, followed by a user study to evaluate effectiveness against Big Five theory. All central claims rest on empirical outcomes from participant interactions and error analysis rather than any mathematical derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems are invoked that reduce results to inputs by construction; the assessment is grounded in external study data collection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard Big Five personality model as an unexamined domain assumption and on the untested premise that LLM agents can faithfully embody distinct personalities. No free parameters or new invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The Big Five personality theory provides a valid and sufficient basis for interpreting interaction data as personality traits.
    Invoked when the prototype is grounded in classic Big Five theory.

pith-pipeline@v0.9.0 · 5723 in / 1265 out tokens · 28654 ms · 2026-05-19T06:31:38.189540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

176 extracted references · 176 canonical work pages · 2 internal anchors

  1. [1]

    Reece Akhtar, Lara Boustani, Dimitrios Tsivrikos, and Tomas Chamorro- Premuzic. 2015. The engageable personality: Personality and trait EI as pre- dictors of work engagement.Personality and individual differences73 (2015), 44–49

  2. [2]

    Gordon W Allport. 1961. Pattern and growth in personality. (1961)

  3. [3]

    Gordon W Allport and Henry S Odbert. 1936. Trait-names: A psycho-lexical study.Psychological monographs47, 1 (1936)

  4. [4]

    American Psychological Association. n.d.. Multiple selves.APA Dictionary of Psychology. Retrieved September 2, 2025, from https://dictionary.apa.org/ multiple-selves

  5. [5]

    2012.Psychologie der persönlichkeit

    Jens B Asendorpf and Franz J Neyer. 2012.Psychologie der persönlichkeit. Springer-Verlag

  6. [6]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  7. [7]

    Robert Axelrod and William D Hamilton. 1981. The evolution of cooperation. science211, 4489 (1981), 1390–1396

  8. [8]

    Verónica Benet-Martínez and Oliver P John. 1998. Los Cinco Grandes across cultures and ethnic groups: Multitrait-multimethod analyses of the Big Five in Spanish and English.Journal of personality and social psychology75, 3 (1998), 729

  9. [9]

    Shlomo Berkovsky, Ronnie Taib, Irena Koprinska, Eileen Wang, Yucheng Zeng, Jingjie Li, and Sabina Kleitman. 2019. Detecting personality traits using eye- tracking data. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12

  10. [10]

    Ilene R Berson, Michael J Berson, Amy M Carnes, and Claudia R Wiedeman

  11. [11]

    Excursion into empathy: exploring prejudice with virtual reality.Social Education82, 2 (2018), 96–100

  12. [12]

    Michal Bialek and Sylvia Terbeck. 2016. Can cognitive psychological research on reasoning enhance the discussion around moral judgments?Cognitive processing 17, 3 (2016), 329–335

  13. [13]

    Yulong Bian, Chenglei Yang, Chao Zhou, Juan Liu, Wei Gai, Xiangxu Meng, Feng Tian, and Chia Shen. 2018. Exploring the weak association between flow experience and performance in virtual environments. InProceedings of the 2018 CHI conference on human factors in computing systems. 1–12

  14. [14]

    Yulong Bian, Chao Zhou, Yeqing Chen, Yanshuai Zhao, Juan Liu, and Chenglei Yang. 2020. The role of the field dependence-independence construct on the flow-performance link in virtual reality. InSymposium on interactive 3D graphics and games. 1–9

  15. [15]

    1986.Symbolic interactionism: Perspective and method

    Herbert Blumer. 1986.Symbolic interactionism: Perspective and method. Univ of California Press

  16. [16]

    2019.Virtual reality for psychological and neurocognitive interventions

    Stéphane Bouchard and A Rizzo. 2019.Virtual reality for psychological and neurocognitive interventions. Springer

  17. [17]

    Urie Bronfenbrenner. 1977. Toward an experimental ecology of human develop- ment.American psychologist32, 7 (1977), 513

  18. [18]

    Alessandro Bruno and Gurmeet Singh. 2022. Personality traits prediction from text via machine learning. In2022 IEEE World Conference on Applied Intelligence and Computing (AIC). IEEE, 588–594

  19. [19]

    Richard Carciofo, Jiaoyan Yang, Nan Song, Feng Du, and Kan Zhang. 2016. Psychometric evaluation of Chinese-language 44-item and 10-item big five personality inventories, including correlations with chronotype, mindfulness and mind wandering.PloS one11, 2 (2016), e0149963

  20. [20]

    Charles S Carver and Jennifer Connor-Smith. 2010. Personality and coping. Annual review of psychology61, 1 (2010), 679–704

  21. [21]

    Nicky Case. 2017. The Evolution of Trust. https://ncase.me/trust/

  22. [22]

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback.arXiv preprint arXiv:2307.15217(2023)

  23. [23]

    Heather EP Cattell. 2001. The sixteen personality factor (16PF) questionnaire. InUnderstanding psychological assessment. Springer, 187–215

  24. [24]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

  25. [25]

    Monica F Contrino, Maribell Reyes-Millán, Patricia Vázquez-Villegas, and Jorge Membrillo-Hernández. 2024. Using an adaptive learning tool to improve student performance and satisfaction in online and face-to-face education for a more personalized approach.Smart Learning Environments11, 1 (2024), 6

  26. [26]

    Paul T Costa and Robert R McCrae. 1988. Personality in adulthood: a six-year longitudinal study of self-reports and spouse ratings on the NEO Personality Inventory.Journal of personality and social psychology54, 5 (1988), 853

  27. [27]

    Paul T Costa and Robert R McCrae. 1999. A five-factor theory of personality. Handbook of personality: Theory and research2, 01 (1999), 1999

  28. [28]

    Nigel Crisp and Lincoln Chen. 2014. Global supply of health professionals.New England Journal of Medicine370, 10 (2014), 950–957

  29. [29]

    Carolina Cruz-Neira, Daniel J Sandin, Thomas A DeFanti, Robert V Kenyon, and John C Hart. 1992. The CAVE: Audio visual experience automatic virtual environment.Commun. ACM35, 6 (1992), 64–73

  30. [30]

    1990.Flow: The psychology of optimal experience

    Mihaly Czikszentmihalyi. 1990.Flow: The psychology of optimal experience. New York: Harper & Row

  31. [31]

    2000.The big five personality factors: the psycholexical approach to personality.Hogrefe & Huber Publishers

    Boele De Raad. 2000.The big five personality factors: the psycholexical approach to personality.Hogrefe & Huber Publishers

  32. [32]

    Erik Derner, Dalibor Kučera, Nuria Oliver, and Jan Zahálka. 2024. Can ChatGPT read who you are?Computers in Human Behavior: Artificial Humans2, 2 (2024), 100088

  33. [33]

    Melissa E DeRosier and James M Thomas. 2019. Hall of heroes: A digital game for social skills training with young adolescents.International Journal of Computer Games Technology2019, 1 (2019), 6981698

  34. [34]

    Ed Diener, Randy J Larsen, and Robert A Emmons. 1984. Person × Situation interactions: Choice of situations and congruence response models.Journal of personality and social psychology47, 3 (1984), 580

  35. [35]

    Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. 2023. Can AI language models replace human participants?Trends in Cognitive Sciences27, 7 (2023), 597–600

  36. [36]

    2012.The Resistance: A valon

    Don Eskridge. 2012.The Resistance: A valon. Indie Boards & Cards. Exploring a Gamified Personality Assessment Method through Interaction with LLM Agents Embodying Different Personalities Conference’17, July 2017, Washington, DC, USA

  37. [37]

    Golnoosh Farnadi, Susana Zoghbi, Marie-Francine Moens, and Martine De Cock

  38. [38]

    https://doi.org/10.1609/icwsm.v7i2.14470

    Recognising Personality Traits Using Facebook Status Updates.Proceedings of the International AAAI Conference on Web and Social Media7, 2 (Nov 2022), 14–18. https://doi.org/10.1609/icwsm.v7i2.14470

  39. [39]

    Franz Faul, Edgar Erdfelder, Axel Buchner, and Albert-Georg Lang. 2009. Statis- tical power analyses using G* Power 3.1: Tests for correlation and regression analyses.Behavior research methods41, 4 (2009), 1149–1160

  40. [40]

    Ernst Fehr and Simon Gächter. 2002. Altruistic punishment in humans.Nature 415, 6868 (2002), 137–140

  41. [41]

    Ali-Reza Feizi-Derakhshi, Mohammad-Reza Feizi-Derakhshi, Majid Ramezani, Narjes Nikzad-Khasmakhi, Meysam Asgari-Chenaghlu, Taymaz Akan, Mehrdad Ranjbar-Khadivi, Elnaz Zafarni-Moattar, and Z Jahanbakhsh-Naghadeh. 2021. The state-of-the-art in text-based automatic personality prediction.arXiv preprint arXiv:2110.01186(2021)

  42. [42]

    Anna Felnhofer, Oswald D Kothgassner, Nathalie Hauk, Leon Beutl, Helmut Hlavacs, and Ilse Kryspin-Exner. 2014. Physical and social presence in collabo- rative virtual environments: Exploring age and gender differences with respect to empathy.Computers in Human Behavior31 (2014), 272–279

  43. [43]

    Daniel Fernau, Stefan Hillmann, Nils Feldhus, Tim Polzehl, and Sebastian Möller

  44. [44]

    InProceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

    Towards personality-aware chatbots. InProceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 135–145

  45. [45]

    Merrill M Flood. 1958. Some experimental games.Management Science5, 1 (1958), 5–26

  46. [46]

    Leilani H Gilpin, Danielle M Olson, and Tarfah Alrashed. 2018. Perception of speaker personality traits using speech signals. InExtended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. 1–6

  47. [47]

    Lewis R Goldberg. 1981. Language and individual differences: The search for universals in personality lexicons.Review of personality and social psychology2, 1 (1981), 141–165

  48. [48]

    description of personality

    Lewis R Goldberg. 2013. An alternative “description of personality”: The Big-Five factor structure. InPersonality and Personality Disorders. Routledge, 34–47

  49. [49]

    Manuel J Gomez, José A Ruipérez-Valiente, and Félix J García Clemente. 2022. A systematic literature review of game-based assessment studies: Trends and challenges.IEEE Transactions on Learning Technologies16, 4 (2022), 500–515

  50. [50]

    Michael Gurven, Christopher Von Rueden, Maxim Massenkoff, Hillard Kaplan, and Marino Lero Vie. 2013. How universal is the Big Five? Testing the five-factor model of personality variation among forager–farmers in the Bolivian Amazon. Journal of personality and social psychology104, 2 (2013), 354

  51. [51]

    Jason L Harman and Justin Purl. 2024. Advances in game-like personality assessment.Trends in Psychology32, 4 (2024), 1445–1459

  52. [52]

    Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Lowe, and Joelle Pineau. 2018. Ethical challenges in data-driven dialogue systems. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 123–129

  53. [53]

    Hirsh and Jordan B

    Jacob B. Hirsh and Jordan B. Peterson. 2009. Personality and language use in self-narratives.Journal of Research in Personality(Jun 2009), 524–527. https: //doi.org/10.1016/j.jrp.2009.01.006

  54. [54]

    Linmei Hu, Hongyu He, Duokang Wang, Ziwang Zhao, Yingxia Shao, and Liqiang Nie. 2024. LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18234–18242

  55. [55]

    Jen-tse Huang, Wenxiang Jiao, Man Ho Lam, Eric John Li, Wenxuan Wang, and Michael Lyu. 2024. On the reliability of psychological scales on large language models. InProceedings of The 2024 Conference on Empirical Methods in Natural Language Processing. 6152–6173

  56. [56]

    Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. 2023. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. InThe Twelfth International Conference on Learning Representations

  57. [57]

    Yuan Jia, Bin Xu, Yamini Karanam, and Stephen Voida. 2016. Personality-targeted gamification: a survey study on personality traits and motivational affordances. InProceedings of the 2016 CHI conference on human factors in computing systems. 2001–2013

  58. [58]

    Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, and Yixin Zhu. 2024. Evaluating and inducing personality in pre-trained language models.Advances in Neural Information Processing Systems36 (2024)

  59. [59]

    O John. 1999. The Big Five trait taxonomy: History, measurement, and theoretical perspectives.Handbook of personality/Guilford(1999)

  60. [60]

    Oliver P John, Laura P Naumann, and Christopher J Soto. 2008. Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research3, 2 (2008), 114–158

  61. [61]

    John and Sanjay Srivastava

    Oliver P. John and Sanjay Srivastava. 1999.Handbook of Personality: Theory and Research(2nd ed.). Guilford Press, New York. Chinese edition: Lawrence A. Pervin, Oliver P. John, 2003:135–184. (Chinese BFI-44 printed on p.176 of the Chinese edition)

  62. [62]

    Seoyoung Kim, Jiyoun Ha, and Juho Kim. 2018. Detecting personality unobtru- sively from users’ online and offline workplace behaviors. InExtended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. 1–6

  63. [63]

    Yoon Jeon Kim, Russell G Almond, and Valerie J Shute. 2016. Applying evidence- centered design for the development of game-based assessments in physics playground.International Journal of Testing16, 2 (2016), 142–163

  64. [64]

    Rodrigo Schames Kreitchmann, Francisco J Abad, Vicente Ponsoda, Maria Do- lores Nieto, and Daniel Morillo. 2019. Controlling for response biases in self- report scales: Forced-choice vs. psychometric modeling of Likert items.Frontiers in psychology10 (2019), 2309

  65. [65]

    Niclas Kuper, Simon M Breil, Kai T Horstmann, Lena Roemer, Tanja Lischetzke, Ryne A Sherman, Mitja D Back, Jaap JA Denissen, and John F Rauthmann. 2022. Individual differences in contingencies between situation characteristics and personality states.Journal of Personality and Social Psychology123, 5 (2022), 1166

  66. [66]

    Richard N Landers and Diana R Sanchez. 2022. Game-based, gamified, and gamefully designed assessments for employee selection: Definitions, distinctions, design, and validation.International Journal of Selection and Assessment30, 1 (2022), 1–13

  67. [67]

    Lee, Kyungil Kim, Young Seok Seo, and Cindy K

    Chang H. Lee, Kyungil Kim, Young Seok Seo, and Cindy K. Chung. 2007. The Relations Between Personality and Language Use.The Journal of General Psychology134 (Oct 2007), 405–413. https://doi.org/10.3200/genp.134.4.405-414

  68. [68]

    Jungjae Lee, Yubin Choi, Minhyuk Song, and Sanghyun Park. 2024. ChatFive: Enhancing User Experience in Likert Scale Personality Test through Interactive Conversation with LLM Agents. InProceedings of the 6th ACM Conference on Conversational User Interfaces. 1–8

  69. [69]

    Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, and Katia Sycara. 2023. Theory of mind for multi-agent collabo- ration via large language models.arXiv preprint arXiv:2310.10701(2023)

  70. [70]

    Ningke Li, Yuekang Li, Yi Liu, Ling Shi, Kailong Wang, and Haoyu Wang. 2024. Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models.Proc. ACM Program. Lang.8, OOPSLA2, Article 336 (Oct. 2024), 30 pages

  71. [71]

    Ruosen Li, Ziming Luo, and Xinya Du. 2024. FG-PRM: Fine-grained Halluci- nation Detection and Mitigation in Language Model Mathematical Reasoning. arXiv:2410.06304 [cs.CL] https://arxiv.org/abs/2410.06304

  72. [72]

    Zheng Li, Dawei Zhu, Qilong Ma, Weimin Xiong, and Sujian Li. 2025. EERPD: Leveraging Emotion and Emotion Regulation for Improving Personality De- tection. InProceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 7721–

  73. [73]

    https://aclanthology.org/2025.coling-main.516/

  74. [74]

    Rensis Likert. [n. d.]. A technique for the measurement of attitudes. ([n. d.])

  75. [75]

    Chuang-Chun Liu, I Chang, et al. 2012. Measuring the flow experience of players playing online games. (2012)

  76. [76]

    Jianzhi Liu, Hexiang Gu, Tianyu Zheng, Liuyu Xiang, Huijia Wu, Jie Fu, and Zhaofeng He. 2024. Dynamic Generation of Personalities with Large Language Models.arXiv preprint arXiv:2404.07084(2024)

  77. [77]

    François Mairesse, Marilyn A Walker, Matthias R Mehl, and Roger K Moore

  78. [78]

    Using linguistic cues for the automatic recognition of personality in conversation and text.Journal of artificial intelligence research30 (2007), 457– 500

  79. [79]

    2003.Personality traits

    Gerald Matthews, Ian J Deary, and Martha C Whiteman. 2003.Personality traits. Cambridge University Press

  80. [80]

    John-Luke McCord, Jason L Harman, and Justin Purl. 2019. Game-like person- ality testing: An emerging mode of personality assessment.Personality and Individual Differences143 (2019), 95–102

Showing first 80 references.