pith. sign in

arxiv: 2411.00585 · v2 · submitted 2024-11-01 · 💻 cs.CY · cs.AI

Fairness Testing of Large Language Models in Role-Playing

Pith reviewed 2026-05-23 18:15 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords fairness testinglarge language modelsrole-playingsocial biasdemographic attributesempirical evaluationbias detectionLLM evaluation
0
0 comments X

The pith

Testing shows ten LLMs produce between 7,579 and 16,963 biased responses each when asked to role-play across 11 demographic groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether social biases appear when LLMs are prompted to adopt specific social roles. It generates 550 roles covering 11 demographic attributes and turns them into 33,000 questions in yes/no, multiple-choice, and open formats. These questions are fed to ten advanced models, and biased outputs are flagged with rule-based and LLM-based detectors that were checked by humans. The results show more than 100,000 biased responses overall, indicating that role-playing prompts reliably surface demographic biases. This matters because role-playing is a common way to make LLMs useful in real applications.

Core claim

Using 33,000 role-specific questions built from 550 social roles that span 11 demographic attributes, evaluations of ten LLMs identify 107,580 biased responses, with each model producing between 7,579 and 16,963 such responses.

What carries the argument

The 33,000 role-specific questions generated from 550 social roles across 11 demographic attributes, paired with rule-based and LLM-based bias identification validated by human review.

If this is right

  • Role-playing prompts cause LLMs to produce biased answers tied to demographic identities.
  • The bias appears across all ten models tested and across yes/no, multiple-choice, and open-ended question types.
  • The released set of 33,000 questions and detection scripts can be reused to measure bias in new models.
  • Bias rates vary by model but remain high enough in every case to affect applications that rely on role adoption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Application builders who use role-playing features may need separate bias filters before deployment.
  • The same question-generation approach could be applied to test bias in other interaction styles such as story continuation or advice-giving.
  • If the bias detection methods under-count subtle cases, the actual prevalence could be even higher than reported.

Load-bearing premise

The rule-based and LLM-based methods correctly flag biased answers without missing subtle cases or incorrectly labeling neutral ones.

What would settle it

Independent human review of a random sample of the flagged responses that finds the true bias rate is less than half the reported figure.

Figures

Figures reproduced from arXiv: 2411.00585 by Jie M. Zhang, Tianlin Li, Weisong Sun, Xinyue Li, Xuanzhe Liu, Yang Liu, Yiling Lou, Ying Xiao, Zhenpeng Chen.

Figure 1
Figure 1. Figure 1: Examples of biased responses from GPT4o-mini and Llama3-70b during role-playing. Each question [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of BiasLens. Tizpaz-Niari et al. [51] present a testing approach that examines how hyperparameter configurations in machine learning models impact fairness outcomes. Building on gradient-based techniques, Zhang et al. [61] develop ADF to efficiently generate test cases that expose fairness violations. Taking a different approach, Zheng et al. [66] propose NeuronFair, which identifies and leverages… view at source ↗
Figure 3
Figure 3. Figure 3: Example prompt for role generation related to the occupation attribute. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompts for question generation. In summary, for Yes/No questions, the prompt consists of: Task description + Example (Yes/No) + Requirements + Format (Yes/No). For Choice questions, the prompt consists of: Task description + Example (Choice) + Requirements + Format (Choice). For Why questions, the prompt consists of: Task description + Example (Why) + Requirements + Format (Why). The complete prompts for … view at source ↗
Figure 5
Figure 5. Figure 5: Example response generated by Llama-3-8B to a Why question on September 29, 2024. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for three judge LLMs. each one. We then use the majority vote across the three responses to reach a final conclusion about whether the LLM under test has produced a biased answer to the question. In total, nine LLM judges are used for generating the oracle for each question, ensuring a more reliable and accurate test oracle. For the evaluation, we use GPT4o-mini [9] due to its moderate cost, making … view at source ↗
Figure 7
Figure 7. Figure 7: (RQ1) Average biased responses per demographic attribute across six LLMs. The attributes are presented [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (RQ1) Proportion of questions that elicit biased responses in one to six LLMs. Overall, the moderate [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have become foundational in modern language-driven software applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we conduct an empirical study on fairness testing of LLMs in role-playing scenarios. To enable this testing, we use LLMs to generate 550 social roles spanning a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions that target various forms of bias. These questions, covering Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the test cases, we conduct extensive evaluations of 10 advanced LLMs. The evaluation reveal 107,580 biased responses across the studied LLMs, with individual models yielding between 7,579 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the dataset, along with all scripts and experimental results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper conducts an empirical study on fairness testing of LLMs in role-playing scenarios. It generates 550 social roles across 11 demographic attributes, creating 33,000 role-specific questions in Yes/No, multiple-choice, and open-ended formats. These are used to evaluate 10 LLMs, identifying biased responses using rule-based and LLM-based strategies validated by human evaluation. The study reports 107,580 biased responses across the models (ranging from 7,579 to 16,963 per model) and releases the dataset, scripts, and results.

Significance. If the bias detection pipeline is reliable, the work demonstrates the prevalence of social biases in LLM role-playing and provides a substantial public dataset and evaluation framework for future fairness research in this area. The public release of the dataset and scripts strengthens the contribution by enabling reproducibility.

major comments (1)
  1. [Abstract and Evaluation Strategy] Abstract and Evaluation Strategy: The bias identification relies on rule-based and LLM-based strategies 'rigorously validated through human evaluation,' but no quantitative details are provided on inter-annotator agreement, the size of the validation set, precision or recall on role-playing responses, or error analysis by bias type. Since the central counts (107,580 biased responses) are produced by this pipeline, even moderate error rates could substantially alter the reported prevalence.
minor comments (1)
  1. [Abstract] The sentence 'The evaluation reveal 107,580 biased responses' contains a grammatical error ('reveal' should be 'reveals').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the evaluation strategy. We address the concern point by point below.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Strategy] Abstract and Evaluation Strategy: The bias identification relies on rule-based and LLM-based strategies 'rigorously validated through human evaluation,' but no quantitative details are provided on inter-annotator agreement, the size of the validation set, precision or recall on role-playing responses, or error analysis by bias type. Since the central counts (107,580 biased responses) are produced by this pipeline, even moderate error rates could substantially alter the reported prevalence.

    Authors: We agree that the current manuscript lacks the requested quantitative details on the human validation of the bias identification pipeline. In the revised version we will add a dedicated subsection reporting: (1) the size of the human validation set, (2) inter-annotator agreement (e.g., Cohen’s or Fleiss’ kappa), (3) precision and recall of both the rule-based and LLM-based detectors measured against the human labels on role-playing responses, and (4) an error analysis stratified by bias type. These additions will directly address the concern that moderate error rates could affect the reported prevalence figures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical count of detector-flagged responses

full rationale

The paper performs an empirical measurement: it generates 33,000 role-specific questions via LLMs, applies rule-based plus LLM-based bias detectors (human-validated per the abstract), and reports the resulting counts (107580 total biased responses). No equations, fitted parameters, predictions, or derivations are present. The central claim is a direct tally of detector outputs on the generated test cases; it does not reduce to any self-definition, self-citation chain, or renaming of inputs. The quality of the detectors is a separate validity concern, not a circularity issue under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical measurement study; it introduces no mathematical free parameters, no new physical or logical entities, and relies on one domain-level assumption about bias detection.

axioms (1)
  • domain assumption Biased responses in role-playing can be reliably identified by a combination of rule-based and LLM-based strategies that were validated by human evaluation.
    This assumption is required to convert raw model outputs into the reported count of 107,580 biased responses.

pith-pipeline@v0.9.0 · 5798 in / 1210 out tokens · 50680 ms · 2026-05-23T18:15:46.192012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 unverdicted novelty 7.0

    StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.

  2. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 accept novelty 7.0

    StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

  3. Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study

    cs.LG 2026-04 unverdicted novelty 5.0

    Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues r...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Adopt-a-persona-claude

    2024. Adopt-a-persona-claude. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system- prompts

  2. [2]

    Adopt-a-persona-gemini

    2024. Adopt-a-persona-gemini. https://support.google.com/a/users/answer/14667148?visit_id=638649091395709697- 2537054327&hl=en&rd=1

  3. [3]

    Adopt-a-persona-meta

    2024. Adopt-a-persona-meta. https://www.llama.com/docs/how-to-guides/prompting

  4. [4]

    Adopt-a-persona-mistral

    2024. Adopt-a-persona-mistral. https://docs.mistral.ai/guides/prompting_capabilities/

  5. [5]

    Adopt-a-persona-openai

    2024. Adopt-a-persona-openai. https://platform.openai.com/docs/guides/prompt-engineering/tactic-ask-the-model- to-adopt-a-persona

  6. [6]

    Chatbot Arena LLM Leaderboard: Community-driven evaluation for best LLM and AI chatbots

    2024. Chatbot Arena LLM Leaderboard: Community-driven evaluation for best LLM and AI chatbots. https://lmarena. ai/

  7. [7]

    DeepSeek-V2.5

    2024. DeepSeek-V2.5. https://huggingface.co/deepseek-ai/DeepSeek-V2.5

  8. [8]

    2024. GPT4o. https://platform.openai.com/docs/models/gpt-4o

  9. [9]

    GPT4o-mini

    2024. GPT4o-mini. https://platform.openai.com/docs/models/gpt-4o-mini

  10. [10]

    Meta-Llama-3-70B

    2024. Meta-Llama-3-70B. https://huggingface.co/meta-llama/Meta-Llama-3-70B

  11. [11]

    Meta-Llama-3-8B

    2024. Meta-Llama-3-8B. https://huggingface.co/meta-llama/Meta-Llama-3-8B

  12. [12]

    Mistral-7B-Instruct-v0.3

    2024. Mistral-7B-Instruct-v0.3. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

  13. [13]

    Qwen1.5-110B-Chat

    2024. Qwen1.5-110B-Chat. https://huggingface.co/Qwen/Qwen1.5-110B-Chat

  14. [14]

    Replication package

    2024. Replication package. https://github.com/LLMBias/BiasLens

  15. [15]

    Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2022. BiasFinder: Metamorphic test generation to uncover bias for sentiment analysis systems. IEEE Transactions on Software Engineering 48, 12 (2022), 5087–5101

  16. [16]

    Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

    Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering 41, 5 (2015), 507–525

  17. [17]

    Yuriy Brun and Alexandra Meliou. 2018. Software fairness. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018 . 754–759

  18. [18]

    Deborah Carlander, Kiyoshiro Okada, Henrik Engström, and Shuichi Kurabayashi. 2024. Controlled chain of thought: Eliciting role-play understanding in LLM through prompts. InProceedings of IEEE Conference on Games, CoG 2024 . 1–4

  19. [19]

    Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A comprehensive study on challenges in deploying deep learning based software. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020 . 750–762

  20. [20]

    Zhang, Max Hort, Mark Harman, and Federica Sarro

    Zhenpeng Chen, Jie M. Zhang, Max Hort, Mark Harman, and Federica Sarro. 2024. Fairness testing: A comprehensive survey and analysis of trends. ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 137:1–137:59

  21. [21]

    Zhang, Federica Sarro, and Mark Harman

    Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman. 2022. MAAT: A novel ensemble approach to addressing fairness and performance bugs for machine learning software. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022 . 1122–1134

  22. [22]

    Zhang, Federica Sarro, and Mark Harman

    Zhenpeng Chen, Jie M. Zhang, Federica Sarro, and Mark Harman. 2023. A comprehensive empirical study of bias mitigation methods for machine learning classifiers. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 106:1–106:30

  23. [23]

    Jordan, Joseph E

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluating LLMs by human preference. In Proceedings of the Forty-first International Conference on Machine Learning, ICML 2024

  24. [24]

    Zhibo Chu, Zichong Wang, and Wenbin Zhang. 2024. Fairness in large language models: A Taxonomic Survey.SIGKDD Exploration 26, 1 (2024), 34–48

  25. [25]

    Xuanqi Gao, Juan Zhai, Shiqing Ma, Chao Shen, Yufei Chen, and Qian Wang. 2022. Fairneuron: Improving deep neural network fairness with adversary games on selective neurons. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022 . 921–933

  26. [26]

    Karen Gonsalkorale, Jeffrey W Sherman, and Karl Christoph Klauer. 2009. Aging and prejudice: Diminished regulation of automatic race bias among older adults. Journal of Experimental Social Psychology 45, 2 (2009), 410–414

  27. [27]

    James D Gwartney and Kenneth M McCaffree. 1971. Variance in discrimination among occupations.Southern Economic Journal (1971), 141–155

  28. [28]

    Amit Haim, Alejandro Salinas, and Julian Nyarko. 2024. What’s in a name? Auditing large language models for race and gender bias. arXiv preprint arXiv:2402.14875 (2024)

  29. [29]

    Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies 28, 12 (2023), 15873–15892. , Vol. 1, No. 1, Article . Publication date: November 2024. 20 Li et al

  30. [30]

    A Woman is More Culturally Knowledgeable than A Man?

    Mahammed Kamruzzaman, Hieu Nguyen, Nazmul Hassan, and Gene Louis Kim. 2024. "A Woman is More Culturally Knowledgeable than A Man?": The Effect of Personas on Cultural Norm Interpretation in LLMs. CoRR abs/2409.11636 (2024)

  31. [31]

    Hadas Kotek, Rikker Dockum, and David Q. Sun. 2023. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, CI 2023 . 12–24

  32. [32]

    J Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33 1 (1977), 159–74

  33. [33]

    Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023. A survey on fairness in large language models. CoRR abs/2308.10149 (2023)

  34. [34]

    Ryan Louie, Ananjan Nandi, William Fang, Cheng Chang, Emma Brunskill, and Diyi Yang. 2024. Roleplay-doh: Enabling domain-experts to create LLM-simulated patients via eliciting and adhering to principles. CoRR abs/2407.00870 (2024)

  35. [35]

    Li-Chun Lu, Shou-Jen Chen, Tsung-Min Pai, Chan-Hung Yu, Hung-yi Lee, and Shao-Hua Sun. 2024. LLM discussion: Enhancing the creativity of large language models via discussion framework and role-play. CoRR abs/2405.06373 (2024)

  36. [36]

    Verya Monjezi, Ashutosh Trivedi, Gang Tan, and Saeid Tizpaz-Niari. 2023. Information-Theoretic Testing and Debugging of Fairness Defects in Deep Neural Networks. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023. 1571–1582

  37. [37]

    Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021 . Association for Computational Linguistics, 5356–5371

  38. [38]

    Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2024. An empirical study of the non-determinism of ChatGPT in code generation. ACM Transactions on Software Engineering and Methodology (2024)

  39. [39]

    I’m fully who I am

    Anaelia Ovalle, Palash Goyal, Jwala Dhamala, Zachary Jaggers, Kai-Wei Chang, Aram Galstyan, Richard S. Zemel, and Rahul Gupta. 2023. “I’m fully who I am”’: Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2023. 1246–1266

  40. [40]

    Shubham Pandey, Archana Patel, and Purvi Pokhariyal. 2024. Exploring the role of ChatGPT in the law enforcement and banking sectors. Artificial Intelligence for Risk Mitigation in the Financial Industry (2024), 327–347

  41. [41]

    Juliane Ressel, Michaele Völler, Finbarr Murphy, and Martin Mullins. 2024. Addressing the notion of trust around ChatGPT in the high-stakes use case of insurance. Technology in Society (2024), 102644

  42. [42]

    Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. 2023. In-context impersonation reveals large language models’ strengths and biases. InProceesings of Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023

  43. [43]

    Im not Racist but

    Abel Salinas, Louis Penafiel, Robert McCormack, and Fred Morstatter. 2023. “Im not racist but... ”: Discovering bias in the internal knowledge of large language models. CoRR abs/2310.08780 (2023)

  44. [44]

    Smith, and Yejin Choi

    Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. 2020. Social Bias Frames: Reasoning about Social and Power Implications of Language. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 . 5477–5490

  45. [45]

    Burcu Sayin, Pasquale Minervini, Jacopo Staiano, and Andrea Passerini. 2024. Can LLMs correct physicians, yet? Investigating effective interaction methods in the medical domain. In Proceedings of the 6th Clinical Natural Language Processing Workshop, Clinical NLP 2024. 218–237

  46. [46]

    Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models. Nature 623, 7987 (2023), 493–498

  47. [47]

    I’m sorry to hear that

    Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. 2022. “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 . 9180–9211

  48. [48]

    Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay

    Ezekiel O. Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay. 2022. Astraea: Grammar-based fairness testing. IEEE Transactions on Software Engineering 48, 12 (2022), 5188–5211

  49. [49]

    Zeyu Sun, Zhenpeng Chen, Jie Zhang, and Dan Hao. 2024. Fairness testing of machine translation systems. ACM Transactions on Software Engineering and Methodology 33, 6 (2024), 156

  50. [50]

    Yan Tao, Olga Viberg, Ryan S Baker, and René F Kizilcec. 2024. Cultural bias and cultural alignment of large language models. PNAS nexus 3, 9 (2024)

  51. [51]

    Saeid Tizpaz-Niari, Ashish Kumar, Gang Tan, and Ashutosh Trivedi. 2022. Fairness-aware configuration of machine learning libraries. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022 . 909–920. , Vol. 1, No. 1, Article . Publication date: November 2024. Benchmarking Bias in Large Language Models during Rol...

  52. [52]

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen

  53. [53]

    CoRR abs/2406.01171 (2024)

    Two tales of persona in LLMs: A survey of role-playing and personalization. CoRR abs/2406.01171 (2024)

  54. [54]

    Kelly is a warm person, Joseph is a role model

    Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. 2023. “Kelly is a warm person, Joseph is a role model”’: Gender biases in LLM-generated reference letters. InProceedings of Findings of the Association for Computational Linguistics: EMNLP 2023 . 3730–3748

  55. [55]

    Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the bias in conversational AI system. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023 . 515–527

  56. [56]

    Chao Wang, Zhenpeng Chen, and Minghui Zhou. 2023. AutoML from software engineering perspective: Landscapes and challenges. In Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories, MSR

  57. [57]

    Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael R. Lyu. 2024. Not all countries celebrate Thanksgiving: On the cultural dominance in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 . 6349–6384

  58. [58]

    Craig S Webster, Saana Taylor, Courtney Thomas, and Jennifer M Weller. 2022. Social bias, discrimination and inequity in healthcare: Mechanisms, implications and recommendations. BJA education 22, 4 (2022), 131–137

  59. [59]

    Jinfeng Wen, Zhenpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An empirical study on challenges of application development in serverless computing. InProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 . 416–428

  60. [60]

    Cristina G Wilson, Amy T Nusbaum, Paul Whitney, and John M Hinson. 2018. Age-differences in cognitive flexibility when overcoming a preexisting bias through feedback. Journal of clinical and experimental neuropsychology 40, 6 (2018), 586–594

  61. [61]

    Zhang, Mark Harman, Lei Ma, and Yang Liu

    Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 2 (2022), 1–36

  62. [62]

    Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, and Jin Song Dong. 2022. Automatic Fairness Testing of Neural Classifiers Through Adversarial Sampling. IEEE Trans. Software Eng. 48 (2022)

  63. [63]

    Lyu, and Miryung Kim

    Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael R. Lyu, and Miryung Kim. 2019. An empirical study of common challenges in developing deep learning applications. InProceedings of the 30th IEEE International Symposium on Software Reliability Engineering, ISSRE 2019. 104–115

  64. [64]

    Huaqin Zhao, Zhengliang Liu, Zihao Wu, Yiwei Li, Tianze Yang, Peng Shu, Shaochen Xu, Haixing Dai, Lin Zhao, Gengchen Mai, Ninghao Liu, and Tianming Liu. 2024. Revolutionizing finance with LLMs: An overview of applications and insights. CoRR abs/2401.11641 (2024)

  65. [65]

    Jinman Zhao, Zifan Qian, Linbo Cao, Yining Wang, and Yitian Ding. 2024. Bias and Toxicity in Role-Play Reasoning. CoRR abs/2409.13979 (2024)

  66. [66]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. CoRR abs/2303.18223 (2023)

  67. [67]

    Haibin Zheng, Zhiqing Chen, Tianyu Du, Xuhong Zhang, Yao Cheng, Shouling Ji, Jingyi Wang, Yue Yu, and Jinyin Chen. 2022. NeuronFair: Interpretable White-Box Fairness Testing through Biased Neuron Identification. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022 . , Vol. 1, No. 1, Article . Publication date: November 2024