Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
Pith reviewed 2026-06-29 07:53 UTC · model grok-4.3
The pith
A human-annotated benchmark of 1897 Chinese adversarial prompts shows that English safety alignments do not transfer to Chinese-language settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a dedicated benchmark of adversarial Chinese prompts with human gold-standard labels for response type, obfuscation category, and risk level is required to evaluate LLM safety, because English-trained systems break down on language-specific evasion techniques including Pinyin romanization, character decomposition, internet slang, and hedging tone.
What carries the argument
The ChiSafe-PAS dataset, which supplies 1897 prompts and 1544 fully annotated entries with a three-class response label, nine-category obfuscation taxonomy, risk ratings, and annotator rationales.
If this is right
- Safety evaluations using this benchmark will report lower performance for Chinese inputs than for English inputs on the same risk categories.
- The nine-category obfuscation taxonomy supplies a structured way to measure and improve detection of Chinese-specific evasion methods.
- Models can be compared consistently on refusal, safe-redirect, and full-response behavior within the four domains of self-harm, illicit trade, fraud, and satire.
- Annotation rationales provide traceable evidence for why certain prompts should trigger refusal rather than response.
- The dataset supports direct measurement of the gap between English safety training and actual behavior in Chinese contexts.
Where Pith is reading between the lines
- The same annotation approach could be applied to additional languages that use non-Latin scripts or romanization to create parallel benchmarks.
- Training data drawn from the obfuscation categories might reduce the need for post-hoc filtering in deployed Chinese systems.
- The distinction between training and evaluation data becomes harder to maintain once such annotated adversarial sets exist.
- Risk-level ratings allow experiments that test whether higher-rated prompts produce measurably different model outputs across domains.
Load-bearing premise
The human annotators' labels accurately identify genuine Chinese evasion techniques and the correct safety responses for each prompt.
What would settle it
A controlled test in which multiple LLMs achieve the same refusal rates on these Chinese prompts as on matched English prompts would show that the benchmark does not isolate unique language failures.
read the original abstract
When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChiSafe-PAS, a human-annotated benchmark of 1,897 adversarial Chinese prompts across four high-stakes domains (self-harm/violence, drug/illicit trade, fraud, satire), with 1,544 entries carrying complete gold-standard labels: 3-class response (REFUSE/SAFE-REDIRECT/RESPOND), 9-category obfuscation taxonomy, risk-level rating, and rationale. The work aims to address gaps in English-centric LLM safety evaluation by capturing Chinese-specific evasion techniques such as Pinyin romanization, character decomposition, internet slang, and hedging tone.
Significance. If the annotations are shown to be reliable, ChiSafe-PAS would supply a valuable, culturally grounded resource for benchmarking LLM safety alignment in Chinese, enabling targeted testing of evasion methods that defeat existing safeguards and supporting more representative evaluation in high-stakes domains.
major comments (1)
- [Annotation Process] Annotation Process section (as described in the abstract and full text): the manuscript provides no inter-annotator agreement statistics, no information on the number of annotators per item, and no adjudication procedure. This is load-bearing for the central claim because the utility of the 1,544 gold-standard entries as reliable labels for culturally specific categories (e.g., the nine obfuscation types) cannot be assessed without quantitative validation of consistency.
minor comments (1)
- [Abstract] Abstract: the token 'bound-aries' is a hyphenation artifact and should read 'boundaries'.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. The single major comment identifies a genuine gap in the current manuscript regarding the annotation process details. We address it directly below and will revise accordingly.
read point-by-point responses
-
Referee: [Annotation Process] Annotation Process section (as described in the abstract and full text): the manuscript provides no inter-annotator agreement statistics, no information on the number of annotators per item, and no adjudication procedure. This is load-bearing for the central claim because the utility of the 1,544 gold-standard entries as reliable labels for culturally specific categories (e.g., the nine obfuscation types) cannot be assessed without quantitative validation of consistency.
Authors: We agree with the referee that the manuscript does not report inter-annotator agreement statistics, the number of annotators per item, or the adjudication procedure. Although the text states that the annotation process is described in detail, these quantitative reliability metrics were not included. This information is necessary to substantiate the reliability of the 1,544 gold-standard labels, especially for the nine-category obfuscation taxonomy. In the revised manuscript we will add a dedicated subsection on the annotation process that reports: (1) the total number of annotators and the number assigned per item, (2) inter-annotator agreement metrics (e.g., pairwise agreement percentages and Cohen’s or Fleiss’ kappa), and (3) the adjudication protocol used to resolve disagreements. These additions will directly support the central claim of reliable, culturally grounded labels. revision: yes
Circularity Check
No circularity: new dataset contribution with no derivations or self-referential reductions
full rationale
The paper presents ChiSafe-PAS, a human-annotated benchmark of adversarial Chinese prompts with labels for response type, obfuscation taxonomy, risk level, and rationale. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The central claim is the creation and annotation of this dataset itself; its value rests on the annotation process rather than any reduction of outputs to inputs by construction. No self-citation load-bearing steps, ansatzes, or renamings of known results are present. This is a standard dataset paper whose contribution is independent of the circularity patterns enumerated.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen technical report. arXiv preprint arXiv:2309.16609. Valerio Basile, Tommaso Caselli, Alexandra Bal - ahur, and Lun-Wei Ku
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
See https://vicuna
Vicuna: An open-source chatbot impressing gpt -4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6. Justin Cui, Wei -Lin Chiang, Ion Stoica, and Cho - Jui Hsieh
2023
-
[3]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red teaming lan - guage models to reduce harms: Methods, scal- ing behaviors, and lessons learned. arXiv preprint arXiv:2209.07858. Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Gpt-4o system card. arXiv preprint arXiv:2410.21276. Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama guard: Llm- based input-output safeguard for human-ai con- versations. arXiv preprint arXiv:2312.06674. Heng Ji and Kevin Knight
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Harmbench: A standardized evaluation frame - work for automated red teaming and robust re - fusal. arXiv preprint arXiv:2402.04249. Maja Pavlovic and Massimo Poesio
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPer - spectives)@ LREC-COLING 2024, pages 100–
The effectiveness of llms as annotators: A compara- tive overview and empirical analysis of direct rep- resentation. In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPer - spectives)@ LREC-COLING 2024, pages 100–
2024
-
[8]
In Proceedings of the 2022 conference on empirical methods in natural lan- guage processing, pages 10671–10682
The ”problem” of human la- bel variation: On ground truth in data, model - ing and evaluation. In Proceedings of the 2022 conference on empirical methods in natural lan- guage processing, pages 10671–10682. Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang
2022
-
[9]
arXiv preprint arXiv:2304.10436
Safety assess- ment of chinese large language models. arXiv preprint arXiv:2304.10436. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin
-
[10]
In Find- ings of the Association for Computational Lin - guistics: EACL 2024, pages 896–911
Do-not- answer: Evaluating safeguards in llms. In Find- ings of the Association for Computational Lin - guistics: EACL 2024, pages 896–911. Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt
2024
-
[11]
In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 6012–6025, Miami, Florida, USA
ToxiCloakCN: Evalu - ating robustness of offensive language detec - tion in Chinese with cloaking perturbations . In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 6012–6025, Miami, Florida, USA. Asso - ciation for Computational Linguistics. Shujian Yang, Shiyao Cui, Chuanrui Hu, Haicheng Wang, Tianwei Zha...
2024
-
[12]
In Findings of the Association for Computational Linguistics: ACL 2025, pages 14382 –14396, Vienna, Aus - tria
Exploring multimodal challenges in toxic Chinese detection: Taxon- omy, benchmark, and findings . In Findings of the Association for Computational Linguistics: ACL 2025, pages 14382 –14396, Vienna, Aus - tria. Association for Computational Linguistics. WeiMing Ye and Luming Zhao
2025
-
[13]
In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing , pages 15845–15860, Suzhou, China
The state of multilingual LLM safety research: From measuring the language gap to mitigating it . In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing , pages 15845–15860, Suzhou, China. Associa - tion for Computational Linguistics. Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, an...
2025
-
[14]
arXiv preprint arXiv:2308.06463
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463. Wajdi Zaghouani, Kholoud Khalil Aldous, and Fe - jzullaj Isra. 2026a. Albanianllmsafety: A safety evaluation dataset for large language models in albanian. In Proceedings of LREC
-
[15]
Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek, Olzhasbek Zhakenov, and Adiya Akhmetzhanova. 2026b. Kz -safetyprompts: A kazakh safety evaluation prompt dataset for large language models. In Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI and DCLRL at LREC
2026
-
[16]
arXiv preprint arXiv:2601.00588
Cssbench: Evaluating the safety of lightweight llms against chinese -specific adversarial pat - terns. arXiv preprint arXiv:2601.00588
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.