pith. sign in

arxiv: 2506.04975 · v2 · pith:37DIG5OOnew · submitted 2025-06-05 · 💻 cs.CY

Evaluating Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards

classification 💻 cs.CY
keywords llmsacrosspersona-drivenrefusaltoxicitybehaviorchinesemodel
0
0 comments X
read the original abstract

Recent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. However, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, cross-model analysis of refusal behavior and persona-driven toxicity amplification across four Chinese LLMs, leveraging a comprehensive dataset of over 1,400,000 generated texts. We identify significant disparities in persona-driven refusal behavior, including systematic gender differences in refusal triggering across the evaluated Chinese LLMs. Furthermore, we provide quantitative evidence of persona-driven toxicity amplification with respect to model default baselines. We show that this amplification--whose magnitude varies substantially across models--is driven by interactions across several factors, involving persona conditioning, prompting strategy, target social group, and model-specific safety mechanisms. Leveraging model-specific regression analyses, we systematically characterize how persona categories, target social groups, and prompt templates independently and jointly shape both refusal behavior and output toxicity. As a complementary case study, we further explore an iterative, evaluator-guided mitigation strategy based on model feedback with an external LLM evaluator, demonstrating that highly toxic outputs can be substantially reduced without costly model retraining. Overall, our findings highlight the importance of culturally contextualized safety evaluations for Chinese-language LLMs and provide a structured framework for assessing persona-induced risks and exploratory mitigation strategies in LLM-generated content.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.