Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

Kholoud K. Aldous; Wajdi Zaghouani; Yicheng Gao

arxiv: 2605.29667 · v1 · pith:NGZBTRXRnew · submitted 2026-05-28 · 💻 cs.CL

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

Wajdi Zaghouani , Kholoud K. Aldous , Yicheng Gao This is my paper

Pith reviewed 2026-06-29 07:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM safetyChinese adversarial promptsobfuscation taxonomyhuman annotationhigh-stakes domainsbenchmark datasetsafety alignmentevasion techniques

0 comments

The pith

A human-annotated benchmark of 1897 Chinese adversarial prompts shows that English safety alignments do not transfer to Chinese-language settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ChiSafe-PAS as a collection of 1897 adversarial Chinese prompts across four high-stakes domains, with 1544 entries carrying full human annotations for response categories, obfuscation types, risk levels, and rationales. It establishes that safety systems tuned on English data miss Chinese-specific evasion methods such as Pinyin romanization, character decomposition, slang, and hedging. A reader would care because these gaps leave models open to generating harmful content in real-world Chinese use cases involving self-harm, fraud, and similar risks. The annotations supply concrete labels to measure whether models refuse, redirect, or respond to such prompts. The work treats the dataset as a practical tool for testing alignment rather than relying on scale alone.

Core claim

The paper claims that a dedicated benchmark of adversarial Chinese prompts with human gold-standard labels for response type, obfuscation category, and risk level is required to evaluate LLM safety, because English-trained systems break down on language-specific evasion techniques including Pinyin romanization, character decomposition, internet slang, and hedging tone.

What carries the argument

The ChiSafe-PAS dataset, which supplies 1897 prompts and 1544 fully annotated entries with a three-class response label, nine-category obfuscation taxonomy, risk ratings, and annotator rationales.

If this is right

Safety evaluations using this benchmark will report lower performance for Chinese inputs than for English inputs on the same risk categories.
The nine-category obfuscation taxonomy supplies a structured way to measure and improve detection of Chinese-specific evasion methods.
Models can be compared consistently on refusal, safe-redirect, and full-response behavior within the four domains of self-harm, illicit trade, fraud, and satire.
Annotation rationales provide traceable evidence for why certain prompts should trigger refusal rather than response.
The dataset supports direct measurement of the gap between English safety training and actual behavior in Chinese contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation approach could be applied to additional languages that use non-Latin scripts or romanization to create parallel benchmarks.
Training data drawn from the obfuscation categories might reduce the need for post-hoc filtering in deployed Chinese systems.
The distinction between training and evaluation data becomes harder to maintain once such annotated adversarial sets exist.
Risk-level ratings allow experiments that test whether higher-rated prompts produce measurably different model outputs across domains.

Load-bearing premise

The human annotators' labels accurately identify genuine Chinese evasion techniques and the correct safety responses for each prompt.

What would settle it

A controlled test in which multiple LLMs achieve the same refusal rates on these Chinese prompts as on matched English prompts would show that the benchmark does not isolate unique language failures.

read the original abstract

When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Chinese adversarial safety benchmark with a nine-category obfuscation taxonomy, but no inter-annotator agreement numbers to support the gold-standard labels.

read the letter

This paper introduces ChiSafe-PAS, a set of 1,897 human-annotated Chinese adversarial prompts across self-harm, drugs, fraud, and satire, with 1,544 carrying full labels for response type, risk level, and a nine-category obfuscation taxonomy that includes Pinyin, character decomposition, slang, and hedging.

The new element is the taxonomy and the focus on Chinese-specific evasion methods that English safety alignments often miss. The authors lay out the domains and annotation scheme clearly and keep the goal practical: give people a resource for testing models in real Chinese deployments.

The soft spot is the missing validation for those annotations. The text describes the process and categories but supplies no inter-annotator agreement figures, no count of annotators per item, and no adjudication details. For culturally loaded labels like tone or slang, that leaves the reliability of the gold standard open. Prompt collection methods also lack enough detail to assess selection bias.

The work is aimed at researchers who build or evaluate safety systems for Chinese or other non-English settings. Anyone running multilingual benchmarks would find the resource relevant if the labels hold up.

It deserves peer review. Referees can check the full annotation protocol and ask for the agreement numbers; a revised version with those metrics would be a usable addition to the field.

Referee Report

1 major / 1 minor

Summary. The paper introduces ChiSafe-PAS, a human-annotated benchmark of 1,897 adversarial Chinese prompts across four high-stakes domains (self-harm/violence, drug/illicit trade, fraud, satire), with 1,544 entries carrying complete gold-standard labels: 3-class response (REFUSE/SAFE-REDIRECT/RESPOND), 9-category obfuscation taxonomy, risk-level rating, and rationale. The work aims to address gaps in English-centric LLM safety evaluation by capturing Chinese-specific evasion techniques such as Pinyin romanization, character decomposition, internet slang, and hedging tone.

Significance. If the annotations are shown to be reliable, ChiSafe-PAS would supply a valuable, culturally grounded resource for benchmarking LLM safety alignment in Chinese, enabling targeted testing of evasion methods that defeat existing safeguards and supporting more representative evaluation in high-stakes domains.

major comments (1)

[Annotation Process] Annotation Process section (as described in the abstract and full text): the manuscript provides no inter-annotator agreement statistics, no information on the number of annotators per item, and no adjudication procedure. This is load-bearing for the central claim because the utility of the 1,544 gold-standard entries as reliable labels for culturally specific categories (e.g., the nine obfuscation types) cannot be assessed without quantitative validation of consistency.

minor comments (1)

[Abstract] Abstract: the token 'bound-aries' is a hyphenation artifact and should read 'boundaries'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful and constructive review. The single major comment identifies a genuine gap in the current manuscript regarding the annotation process details. We address it directly below and will revise accordingly.

read point-by-point responses

Referee: [Annotation Process] Annotation Process section (as described in the abstract and full text): the manuscript provides no inter-annotator agreement statistics, no information on the number of annotators per item, and no adjudication procedure. This is load-bearing for the central claim because the utility of the 1,544 gold-standard entries as reliable labels for culturally specific categories (e.g., the nine obfuscation types) cannot be assessed without quantitative validation of consistency.

Authors: We agree with the referee that the manuscript does not report inter-annotator agreement statistics, the number of annotators per item, or the adjudication procedure. Although the text states that the annotation process is described in detail, these quantitative reliability metrics were not included. This information is necessary to substantiate the reliability of the 1,544 gold-standard labels, especially for the nine-category obfuscation taxonomy. In the revised manuscript we will add a dedicated subsection on the annotation process that reports: (1) the total number of annotators and the number assigned per item, (2) inter-annotator agreement metrics (e.g., pairwise agreement percentages and Cohen’s or Fleiss’ kappa), and (3) the adjudication protocol used to resolve disagreements. These additions will directly support the central claim of reliable, culturally grounded labels. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset contribution with no derivations or self-referential reductions

full rationale

The paper presents ChiSafe-PAS, a human-annotated benchmark of adversarial Chinese prompts with labels for response type, obfuscation taxonomy, risk level, and rationale. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The central claim is the creation and annotation of this dataset itself; its value rests on the annotation process rather than any reduction of outputs to inputs by construction. No self-citation load-bearing steps, ansatzes, or renamings of known results are present. This is a standard dataset paper whose contribution is independent of the circularity patterns enumerated.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark-creation paper; it introduces no mathematical models, fitted parameters, or new theoretical entities.

pith-pipeline@v0.9.1-grok · 5779 in / 1166 out tokens · 20275 ms · 2026-06-29T07:53:15.970045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609. Valerio Basile, Tommaso Caselli, Alexandra Bal - ahur, and Lun-Wei Ku

work page internal anchor Pith review Pith/arXiv arXiv
[2]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt -4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6. Justin Cui, Wei -Lin Chiang, Ion Stoica, and Cho - Jui Hsieh

2023
[3]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red teaming lan - guage models to reduce harms: Methods, scal- ing behaviors, and lessons learned. arXiv preprint arXiv:2209.07858. Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GPT-4o System Card

Gpt-4o system card. arXiv preprint arXiv:2410.21276. Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm- based input-output safeguard for human-ai con- versations. arXiv preprint arXiv:2312.06674. Heng Ji and Kevin Knight

work page internal anchor Pith review Pith/arXiv arXiv
[6]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation frame - work for automated red teaming and robust re - fusal. arXiv preprint arXiv:2402.04249. Maja Pavlovic and Massimo Poesio

work page internal anchor Pith review Pith/arXiv arXiv
[7]

In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPer - spectives)@ LREC-COLING 2024, pages 100–

The effectiveness of llms as annotators: A compara- tive overview and empirical analysis of direct rep- resentation. In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPer - spectives)@ LREC-COLING 2024, pages 100–

2024
[8]

In Proceedings of the 2022 conference on empirical methods in natural lan- guage processing, pages 10671–10682

The ”problem” of human la- bel variation: On ground truth in data, model - ing and evaluation. In Proceedings of the 2022 conference on empirical methods in natural lan- guage processing, pages 10671–10682. Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang

2022
[9]

arXiv preprint arXiv:2304.10436

Safety assess- ment of chinese large language models. arXiv preprint arXiv:2304.10436. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

work page arXiv
[10]

In Find- ings of the Association for Computational Lin - guistics: EACL 2024, pages 896–911

Do-not- answer: Evaluating safeguards in llms. In Find- ings of the Association for Computational Lin - guistics: EACL 2024, pages 896–911. Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt

2024
[11]

In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 6012–6025, Miami, Florida, USA

ToxiCloakCN: Evalu - ating robustness of offensive language detec - tion in Chinese with cloaking perturbations . In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 6012–6025, Miami, Florida, USA. Asso - ciation for Computational Linguistics. Shujian Yang, Shiyao Cui, Chuanrui Hu, Haicheng Wang, Tianwei Zha...

2024
[12]

In Findings of the Association for Computational Linguistics: ACL 2025, pages 14382 –14396, Vienna, Aus - tria

Exploring multimodal challenges in toxic Chinese detection: Taxon- omy, benchmark, and findings . In Findings of the Association for Computational Linguistics: ACL 2025, pages 14382 –14396, Vienna, Aus - tria. Association for Computational Linguistics. WeiMing Ye and Luming Zhao

2025
[13]

In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing , pages 15845–15860, Suzhou, China

The state of multilingual LLM safety research: From measuring the language gap to mitigating it . In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing , pages 15845–15860, Suzhou, China. Associa - tion for Computational Linguistics. Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, an...

2025
[14]

arXiv preprint arXiv:2308.06463

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463. Wajdi Zaghouani, Kholoud Khalil Aldous, and Fe - jzullaj Isra. 2026a. Albanianllmsafety: A safety evaluation dataset for large language models in albanian. In Proceedings of LREC

work page arXiv
[15]

Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek, Olzhasbek Zhakenov, and Adiya Akhmetzhanova. 2026b. Kz -safetyprompts: A kazakh safety evaluation prompt dataset for large language models. In Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI and DCLRL at LREC

2026
[16]

arXiv preprint arXiv:2601.00588

Cssbench: Evaluating the safety of lightweight llms against chinese -specific adversarial pat - terns. arXiv preprint arXiv:2601.00588

work page arXiv

[1] [1]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609. Valerio Basile, Tommaso Caselli, Alexandra Bal - ahur, and Lun-Wei Ku

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt -4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6. Justin Cui, Wei -Lin Chiang, Ion Stoica, and Cho - Jui Hsieh

2023

[3] [3]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red teaming lan - guage models to reduce harms: Methods, scal- ing behaviors, and lessons learned. arXiv preprint arXiv:2209.07858. Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GPT-4o System Card

Gpt-4o system card. arXiv preprint arXiv:2410.21276. Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm- based input-output safeguard for human-ai con- versations. arXiv preprint arXiv:2312.06674. Heng Ji and Kevin Knight

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation frame - work for automated red teaming and robust re - fusal. arXiv preprint arXiv:2402.04249. Maja Pavlovic and Massimo Poesio

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPer - spectives)@ LREC-COLING 2024, pages 100–

The effectiveness of llms as annotators: A compara- tive overview and empirical analysis of direct rep- resentation. In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPer - spectives)@ LREC-COLING 2024, pages 100–

2024

[8] [8]

In Proceedings of the 2022 conference on empirical methods in natural lan- guage processing, pages 10671–10682

The ”problem” of human la- bel variation: On ground truth in data, model - ing and evaluation. In Proceedings of the 2022 conference on empirical methods in natural lan- guage processing, pages 10671–10682. Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang

2022

[9] [9]

arXiv preprint arXiv:2304.10436

Safety assess- ment of chinese large language models. arXiv preprint arXiv:2304.10436. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

work page arXiv

[10] [10]

In Find- ings of the Association for Computational Lin - guistics: EACL 2024, pages 896–911

Do-not- answer: Evaluating safeguards in llms. In Find- ings of the Association for Computational Lin - guistics: EACL 2024, pages 896–911. Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt

2024

[11] [11]

In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 6012–6025, Miami, Florida, USA

ToxiCloakCN: Evalu - ating robustness of offensive language detec - tion in Chinese with cloaking perturbations . In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 6012–6025, Miami, Florida, USA. Asso - ciation for Computational Linguistics. Shujian Yang, Shiyao Cui, Chuanrui Hu, Haicheng Wang, Tianwei Zha...

2024

[12] [12]

In Findings of the Association for Computational Linguistics: ACL 2025, pages 14382 –14396, Vienna, Aus - tria

Exploring multimodal challenges in toxic Chinese detection: Taxon- omy, benchmark, and findings . In Findings of the Association for Computational Linguistics: ACL 2025, pages 14382 –14396, Vienna, Aus - tria. Association for Computational Linguistics. WeiMing Ye and Luming Zhao

2025

[13] [13]

In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing , pages 15845–15860, Suzhou, China

The state of multilingual LLM safety research: From measuring the language gap to mitigating it . In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing , pages 15845–15860, Suzhou, China. Associa - tion for Computational Linguistics. Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, an...

2025

[14] [14]

arXiv preprint arXiv:2308.06463

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463. Wajdi Zaghouani, Kholoud Khalil Aldous, and Fe - jzullaj Isra. 2026a. Albanianllmsafety: A safety evaluation dataset for large language models in albanian. In Proceedings of LREC

work page arXiv

[15] [15]

Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek, Olzhasbek Zhakenov, and Adiya Akhmetzhanova. 2026b. Kz -safetyprompts: A kazakh safety evaluation prompt dataset for large language models. In Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI and DCLRL at LREC

2026

[16] [16]

arXiv preprint arXiv:2601.00588

Cssbench: Evaluating the safety of lightweight llms against chinese -specific adversarial pat - terns. arXiv preprint arXiv:2601.00588

work page arXiv