Fairness Testing of Large Language Models in Role-Playing
Pith reviewed 2026-05-23 18:15 UTC · model grok-4.3
The pith
Testing shows ten LLMs produce between 7,579 and 16,963 biased responses each when asked to role-play across 11 demographic groups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using 33,000 role-specific questions built from 550 social roles that span 11 demographic attributes, evaluations of ten LLMs identify 107,580 biased responses, with each model producing between 7,579 and 16,963 such responses.
What carries the argument
The 33,000 role-specific questions generated from 550 social roles across 11 demographic attributes, paired with rule-based and LLM-based bias identification validated by human review.
If this is right
- Role-playing prompts cause LLMs to produce biased answers tied to demographic identities.
- The bias appears across all ten models tested and across yes/no, multiple-choice, and open-ended question types.
- The released set of 33,000 questions and detection scripts can be reused to measure bias in new models.
- Bias rates vary by model but remain high enough in every case to affect applications that rely on role adoption.
Where Pith is reading between the lines
- Application builders who use role-playing features may need separate bias filters before deployment.
- The same question-generation approach could be applied to test bias in other interaction styles such as story continuation or advice-giving.
- If the bias detection methods under-count subtle cases, the actual prevalence could be even higher than reported.
Load-bearing premise
The rule-based and LLM-based methods correctly flag biased answers without missing subtle cases or incorrectly labeling neutral ones.
What would settle it
Independent human review of a random sample of the flagged responses that finds the true bias rate is less than half the reported figure.
Figures
read the original abstract
Large Language Models (LLMs) have become foundational in modern language-driven software applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we conduct an empirical study on fairness testing of LLMs in role-playing scenarios. To enable this testing, we use LLMs to generate 550 social roles spanning a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions that target various forms of bias. These questions, covering Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the test cases, we conduct extensive evaluations of 10 advanced LLMs. The evaluation reveal 107,580 biased responses across the studied LLMs, with individual models yielding between 7,579 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the dataset, along with all scripts and experimental results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical study on fairness testing of LLMs in role-playing scenarios. It generates 550 social roles across 11 demographic attributes, creating 33,000 role-specific questions in Yes/No, multiple-choice, and open-ended formats. These are used to evaluate 10 LLMs, identifying biased responses using rule-based and LLM-based strategies validated by human evaluation. The study reports 107,580 biased responses across the models (ranging from 7,579 to 16,963 per model) and releases the dataset, scripts, and results.
Significance. If the bias detection pipeline is reliable, the work demonstrates the prevalence of social biases in LLM role-playing and provides a substantial public dataset and evaluation framework for future fairness research in this area. The public release of the dataset and scripts strengthens the contribution by enabling reproducibility.
major comments (1)
- [Abstract and Evaluation Strategy] Abstract and Evaluation Strategy: The bias identification relies on rule-based and LLM-based strategies 'rigorously validated through human evaluation,' but no quantitative details are provided on inter-annotator agreement, the size of the validation set, precision or recall on role-playing responses, or error analysis by bias type. Since the central counts (107,580 biased responses) are produced by this pipeline, even moderate error rates could substantially alter the reported prevalence.
minor comments (1)
- [Abstract] The sentence 'The evaluation reveal 107,580 biased responses' contains a grammatical error ('reveal' should be 'reveals').
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the evaluation strategy. We address the concern point by point below.
read point-by-point responses
-
Referee: [Abstract and Evaluation Strategy] Abstract and Evaluation Strategy: The bias identification relies on rule-based and LLM-based strategies 'rigorously validated through human evaluation,' but no quantitative details are provided on inter-annotator agreement, the size of the validation set, precision or recall on role-playing responses, or error analysis by bias type. Since the central counts (107,580 biased responses) are produced by this pipeline, even moderate error rates could substantially alter the reported prevalence.
Authors: We agree that the current manuscript lacks the requested quantitative details on the human validation of the bias identification pipeline. In the revised version we will add a dedicated subsection reporting: (1) the size of the human validation set, (2) inter-annotator agreement (e.g., Cohen’s or Fleiss’ kappa), (3) precision and recall of both the rule-based and LLM-based detectors measured against the human labels on role-playing responses, and (4) an error analysis stratified by bias type. These additions will directly address the concern that moderate error rates could affect the reported prevalence figures. revision: yes
Circularity Check
No circularity: purely empirical count of detector-flagged responses
full rationale
The paper performs an empirical measurement: it generates 33,000 role-specific questions via LLMs, applies rule-based plus LLM-based bias detectors (human-validated per the abstract), and reports the resulting counts (107580 total biased responses). No equations, fitted parameters, predictions, or derivations are present. The central claim is a direct tally of detector outputs on the generated test cases; it does not reduce to any self-definition, self-citation chain, or renaming of inputs. The quality of the detectors is a separate validity concern, not a circularity issue under the defined criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Biased responses in role-playing can be reliably identified by a combination of rule-based and LLM-based strategies that were validated by human evaluation.
Forward citations
Cited by 3 Pith papers
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study
Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues r...
Reference graph
Works this paper leans on
-
[1]
2024. Adopt-a-persona-claude. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system- prompts
work page 2024
-
[2]
2024. Adopt-a-persona-gemini. https://support.google.com/a/users/answer/14667148?visit_id=638649091395709697- 2537054327&hl=en&rd=1
-
[3]
2024. Adopt-a-persona-meta. https://www.llama.com/docs/how-to-guides/prompting
work page 2024
-
[4]
2024. Adopt-a-persona-mistral. https://docs.mistral.ai/guides/prompting_capabilities/
work page 2024
-
[5]
2024. Adopt-a-persona-openai. https://platform.openai.com/docs/guides/prompt-engineering/tactic-ask-the-model- to-adopt-a-persona
work page 2024
-
[6]
Chatbot Arena LLM Leaderboard: Community-driven evaluation for best LLM and AI chatbots
2024. Chatbot Arena LLM Leaderboard: Community-driven evaluation for best LLM and AI chatbots. https://lmarena. ai/
work page 2024
- [7]
-
[8]
2024. GPT4o. https://platform.openai.com/docs/models/gpt-4o
work page 2024
- [9]
-
[10]
2024. Meta-Llama-3-70B. https://huggingface.co/meta-llama/Meta-Llama-3-70B
work page 2024
-
[11]
2024. Meta-Llama-3-8B. https://huggingface.co/meta-llama/Meta-Llama-3-8B
work page 2024
-
[12]
2024. Mistral-7B-Instruct-v0.3. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
work page 2024
-
[13]
2024. Qwen1.5-110B-Chat. https://huggingface.co/Qwen/Qwen1.5-110B-Chat
work page 2024
- [14]
-
[15]
Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2022. BiasFinder: Metamorphic test generation to uncover bias for sentiment analysis systems. IEEE Transactions on Software Engineering 48, 12 (2022), 5087–5101
work page 2022
-
[16]
Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering 41, 5 (2015), 507–525
work page 2015
-
[17]
Yuriy Brun and Alexandra Meliou. 2018. Software fairness. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018 . 754–759
work page 2018
-
[18]
Deborah Carlander, Kiyoshiro Okada, Henrik Engström, and Shuichi Kurabayashi. 2024. Controlled chain of thought: Eliciting role-play understanding in LLM through prompts. InProceedings of IEEE Conference on Games, CoG 2024 . 1–4
work page 2024
-
[19]
Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A comprehensive study on challenges in deploying deep learning based software. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020 . 750–762
work page 2020
-
[20]
Zhang, Max Hort, Mark Harman, and Federica Sarro
Zhenpeng Chen, Jie M. Zhang, Max Hort, Mark Harman, and Federica Sarro. 2024. Fairness testing: A comprehensive survey and analysis of trends. ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 137:1–137:59
work page 2024
-
[21]
Zhang, Federica Sarro, and Mark Harman
Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman. 2022. MAAT: A novel ensemble approach to addressing fairness and performance bugs for machine learning software. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022 . 1122–1134
work page 2022
-
[22]
Zhang, Federica Sarro, and Mark Harman
Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman. 2023. A comprehensive empirical study of bias mitigation methods for machine learning classifiers. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 106:1–106:30
work page 2023
-
[23]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluating LLMs by human preference. In Proceedings of the Forty-first International Conference on Machine Learning, ICML 2024
work page 2024
-
[24]
Zhibo Chu, Zichong Wang, and Wenbin Zhang. 2024. Fairness in large language models: A Taxonomic Survey.SIGKDD Exploration 26, 1 (2024), 34–48
work page 2024
-
[25]
Xuanqi Gao, Juan Zhai, Shiqing Ma, Chao Shen, Yufei Chen, and Qian Wang. 2022. Fairneuron: Improving deep neural network fairness with adversary games on selective neurons. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022 . 921–933
work page 2022
-
[26]
Karen Gonsalkorale, Jeffrey W Sherman, and Karl Christoph Klauer. 2009. Aging and prejudice: Diminished regulation of automatic race bias among older adults. Journal of Experimental Social Psychology 45, 2 (2009), 410–414
work page 2009
-
[27]
James D Gwartney and Kenneth M McCaffree. 1971. Variance in discrimination among occupations.Southern Economic Journal (1971), 141–155
work page 1971
- [28]
-
[29]
Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies 28, 12 (2023), 15873–15892. , Vol. 1, No. 1, Article . Publication date: November 2024. 20 Li et al
work page 2023
-
[30]
A Woman is More Culturally Knowledgeable than A Man?
Mahammed Kamruzzaman, Hieu Nguyen, Nazmul Hassan, and Gene Louis Kim. 2024. "A Woman is More Culturally Knowledgeable than A Man?": The Effect of Personas on Cultural Norm Interpretation in LLMs. CoRR abs/2409.11636 (2024)
-
[31]
Hadas Kotek, Rikker Dockum, and David Q. Sun. 2023. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, CI 2023 . 12–24
work page 2023
-
[32]
J Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33 1 (1977), 159–74
work page 1977
- [33]
- [34]
- [35]
-
[36]
Verya Monjezi, Ashutosh Trivedi, Gang Tan, and Saeid Tizpaz-Niari. 2023. Information-Theoretic Testing and Debugging of Fairness Defects in Deep Neural Networks. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023. 1571–1582
work page 2023
-
[37]
Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021 . Association for Computational Linguistics, 5356–5371
work page 2021
-
[38]
Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2024. An empirical study of the non-determinism of ChatGPT in code generation. ACM Transactions on Software Engineering and Methodology (2024)
work page 2024
-
[39]
Anaelia Ovalle, Palash Goyal, Jwala Dhamala, Zachary Jaggers, Kai-Wei Chang, Aram Galstyan, Richard S. Zemel, and Rahul Gupta. 2023. “I’m fully who I am”’: Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2023. 1246–1266
work page 2023
-
[40]
Shubham Pandey, Archana Patel, and Purvi Pokhariyal. 2024. Exploring the role of ChatGPT in the law enforcement and banking sectors. Artificial Intelligence for Risk Mitigation in the Financial Industry (2024), 327–347
work page 2024
-
[41]
Juliane Ressel, Michaele Völler, Finbarr Murphy, and Martin Mullins. 2024. Addressing the notion of trust around ChatGPT in the high-stakes use case of insurance. Technology in Society (2024), 102644
work page 2024
-
[42]
Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. 2023. In-context impersonation reveals large language models’ strengths and biases. InProceesings of Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023
work page 2023
-
[43]
Abel Salinas, Louis Penafiel, Robert McCormack, and Fred Morstatter. 2023. “Im not racist but... ”: Discovering bias in the internal knowledge of large language models. CoRR abs/2310.08780 (2023)
-
[44]
Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. 2020. Social Bias Frames: Reasoning about Social and Power Implications of Language. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 . 5477–5490
work page 2020
-
[45]
Burcu Sayin, Pasquale Minervini, Jacopo Staiano, and Andrea Passerini. 2024. Can LLMs correct physicians, yet? Investigating effective interaction methods in the medical domain. In Proceedings of the 6th Clinical Natural Language Processing Workshop, Clinical NLP 2024. 218–237
work page 2024
-
[46]
Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models. Nature 623, 7987 (2023), 493–498
work page 2023
-
[47]
Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. 2022. “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 . 9180–9211
work page 2022
-
[48]
Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay
Ezekiel O. Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay. 2022. Astraea: Grammar-based fairness testing. IEEE Transactions on Software Engineering 48, 12 (2022), 5188–5211
work page 2022
-
[49]
Zeyu Sun, Zhenpeng Chen, Jie Zhang, and Dan Hao. 2024. Fairness testing of machine translation systems. ACM Transactions on Software Engineering and Methodology 33, 6 (2024), 156
work page 2024
-
[50]
Yan Tao, Olga Viberg, Ryan S Baker, and René F Kizilcec. 2024. Cultural bias and cultural alignment of large language models. PNAS nexus 3, 9 (2024)
work page 2024
-
[51]
Saeid Tizpaz-Niari, Ashish Kumar, Gang Tan, and Ashutosh Trivedi. 2022. Fairness-aware configuration of machine learning libraries. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022 . 909–920. , Vol. 1, No. 1, Article . Publication date: November 2024. Benchmarking Bias in Large Language Models during Rol...
work page 2022
-
[52]
Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen
-
[53]
Two tales of persona in LLMs: A survey of role-playing and personalization. CoRR abs/2406.01171 (2024)
-
[54]
Kelly is a warm person, Joseph is a role model
Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. 2023. “Kelly is a warm person, Joseph is a role model”’: Gender biases in LLM-generated reference letters. InProceedings of Findings of the Association for Computational Linguistics: EMNLP 2023 . 3730–3748
work page 2023
-
[55]
Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the bias in conversational AI system. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023 . 515–527
work page 2023
-
[56]
Chao Wang, Zhenpeng Chen, and Minghui Zhou. 2023. AutoML from software engineering perspective: Landscapes and challenges. In Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories, MSR
work page 2023
-
[57]
Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael R. Lyu. 2024. Not all countries celebrate Thanksgiving: On the cultural dominance in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 . 6349–6384
work page 2024
-
[58]
Craig S Webster, Saana Taylor, Courtney Thomas, and Jennifer M Weller. 2022. Social bias, discrimination and inequity in healthcare: Mechanisms, implications and recommendations. BJA education 22, 4 (2022), 131–137
work page 2022
-
[59]
Jinfeng Wen, Zhenpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An empirical study on challenges of application development in serverless computing. InProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 . 416–428
work page 2021
-
[60]
Cristina G Wilson, Amy T Nusbaum, Paul Whitney, and John M Hinson. 2018. Age-differences in cognitive flexibility when overcoming a preexisting bias through feedback. Journal of clinical and experimental neuropsychology 40, 6 (2018), 586–594
work page 2018
-
[61]
Zhang, Mark Harman, Lei Ma, and Yang Liu
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 2 (2022), 1–36
work page 2022
-
[62]
Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, and Jin Song Dong. 2022. Automatic Fairness Testing of Neural Classifiers Through Adversarial Sampling. IEEE Trans. Software Eng. 48 (2022)
work page 2022
-
[63]
Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael R. Lyu, and Miryung Kim. 2019. An empirical study of common challenges in developing deep learning applications. InProceedings of the 30th IEEE International Symposium on Software Reliability Engineering, ISSRE 2019. 104–115
work page 2019
- [64]
- [65]
-
[66]
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. CoRR abs/2303.18223 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Haibin Zheng, Zhiqing Chen, Tianyu Du, Xuhong Zhang, Yao Cheng, Shouling Ji, Jingyi Wang, Yue Yu, and Jinyin Chen. 2022. NeuronFair: Interpretable White-Box Fairness Testing through Biased Neuron Identification. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022 . , Vol. 1, No. 1, Article . Publication date: November 2024
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.