Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
Pith reviewed 2026-05-22 05:52 UTC · model grok-4.3
The pith
A three-stage rewriting process generates Chinese toxicity samples that evade detectors while preserving harmful intent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that their CITA method, through harmful intent learning, implicit toxicity enhancement, and obfuscation variant rewriting, produces evaluation samples where detectors exhibit substantial missed-detection risks with an average ASR of 69.48 percent. Human evaluation confirms that the generated texts preserve harmfulness while increasing implicitness and evasiveness. As a practical application, fine-tuning the CITD model with CITA-generated data improves robustness against implicit toxicity.
What carries the argument
The CITA framework's three-stage pipeline that preserves harmful intent, boosts implicitness, and introduces controlled surface variants to create harder-to-detect toxicity examples.
If this is right
- Existing Chinese toxicity detectors show high vulnerability to these implicitly enhanced and obfuscated samples.
- Human raters rate the rewritten samples as comparably harmful but more implicit and evasive than originals.
- Fine-tuning defense models on CITA data leads to better performance against implicit toxicity attacks.
- The approach provides a way to generate red-team data for improving safety in Chinese LLMs.
Where Pith is reading between the lines
- Current detection methods may over-rely on explicit keywords and surface patterns rather than semantic intent.
- Similar generation techniques could be adapted for other languages to test and strengthen toxicity filters.
- Regularly incorporating such adversarial data into training loops might become necessary for maintaining effective content moderation.
Load-bearing premise
The rewriting stages can increase implicitness and obfuscation while fully preserving the original harmful intent and without creating artificial patterns that make the attack success rates appear higher than they would be in real use.
What would settle it
A detector that maintains high detection rates on the CITA-generated samples comparable to explicit ones, or human evaluators who rate the samples as less harmful or not more implicit than the originals, would contradict the main findings.
Figures
read the original abstract
Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Chinese Implicit Toxicity Attack (CITA), a three-stage red-teaming framework (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) for generating implicit and obfuscated Chinese toxicity samples. It reports that these samples achieve an average attack success rate (ASR) of 69.48% against seven toxicity detectors, with human evaluation confirming preserved harmfulness alongside increased implicitness and evasiveness. As a downstream application, the authors fine-tune a Chinese Implicit Toxicity Defense (CITD) model on CITA-generated data and claim improved robustness.
Significance. If the three-stage process reliably preserves original harmful intent while controllably increasing implicitness and surface variants, the work would provide a useful controlled method for red-teaming Chinese toxicity detectors and generating defense training data. This addresses a genuine gap in non-explicit toxicity evaluation for Chinese, where semantic indirectness and obfuscation are common. The downstream CITD fine-tuning result, if reproducible, would demonstrate practical utility of the generated data.
major comments (2)
- [Method / §3 (three-stage process)] The central claim of 69.48% average ASR and preserved harmfulness rests on the three-stage process (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) keeping original intent fixed. However, the manuscript provides no quantitative intent-preservation metrics (e.g., embedding cosine similarity, NLI entailment scores, or separate human intent ratings) or ablations comparing input-output semantic fidelity. Without these, it is unclear whether elevated ASR and human ratings reflect the intended mechanism or uncontrolled semantic shifts introduced by the LLM rewriting steps.
- [Abstract and §4 (evaluation)] The abstract and evaluation sections report concrete ASR numbers and human confirmation but omit details on sample generation mechanics (prompt templates, temperature settings, number of variants per seed), baseline detector implementations, statistical significance tests for the 69.48% figure, or exclusion criteria for the evaluation set. These omissions make it impossible to assess whether post-hoc choices or unstated assumptions inflate the reported missed-detection risks.
minor comments (2)
- [§4] Clarify the exact number of CITA-generated samples used for the seven-detector evaluation and for CITD fine-tuning; state whether the same seeds were used across stages or whether new harmful intents were introduced.
- [Introduction] The claim that CITA is 'not a deployable evasion tool' should be supported by explicit discussion of safeguards or limitations on release of the generated data.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important aspects for improving the clarity and rigor of our work on Chinese Implicit Toxicity Attack (CITA). We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method / §3 (three-stage process)] The central claim of 69.48% average ASR and preserved harmfulness rests on the three-stage process (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) keeping original intent fixed. However, the manuscript provides no quantitative intent-preservation metrics (e.g., embedding cosine similarity, NLI entailment scores, or separate human intent ratings) or ablations comparing input-output semantic fidelity. Without these, it is unclear whether elevated ASR and human ratings reflect the intended mechanism or uncontrolled semantic shifts introduced by the LLM rewriting steps.
Authors: We agree that quantitative evidence for intent preservation would provide stronger support for the mechanism. While the current human evaluation includes explicit ratings on harmfulness preservation (confirming that generated samples retain the original harmful intent), we acknowledge the value of additional automated metrics. In the revised manuscript, we will add embedding cosine similarity scores between seed inputs and final outputs using a Chinese sentence embedding model, along with NLI entailment scores to measure semantic fidelity. We will also include an ablation study that isolates the effect of each stage on ASR while reporting intent preservation metrics, to demonstrate that the observed attack success arises from increased implicitness and obfuscation rather than unintended semantic drift. revision: yes
-
Referee: [Abstract and §4 (evaluation)] The abstract and evaluation sections report concrete ASR numbers and human confirmation but omit details on sample generation mechanics (prompt templates, temperature settings, number of variants per seed), baseline detector implementations, statistical significance tests for the 69.48% figure, or exclusion criteria for the evaluation set. These omissions make it impossible to assess whether post-hoc choices or unstated assumptions inflate the reported missed-detection risks.
Authors: We recognize that these implementation details are essential for reproducibility and for allowing readers to evaluate potential biases. In the revised manuscript, we will expand the experimental setup section to include: (i) the exact prompt templates used in each of the three stages, (ii) generation hyperparameters such as temperature (set to 0.7) and the number of variants generated per seed (five variants), (iii) precise descriptions of the seven baseline detectors including model versions, fine-tuning details, and decision thresholds, (iv) statistical significance testing (e.g., bootstrap confidence intervals and paired tests) for the reported 69.48% average ASR, and (v) explicit exclusion criteria for the evaluation set, such as filtering samples that fail preliminary human checks for intent preservation. These additions will be placed in a new subsection under §4 to improve transparency without altering the core results. revision: yes
Circularity Check
No significant circularity in empirical generation and evaluation framework
full rationale
The paper introduces an empirical three-stage LLM-based generation process (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) to produce Chinese implicit toxicity samples for red-teaming. It then measures attack success rates (average ASR 69.48%) on seven existing detectors and conducts human evaluations for preserved harmfulness and increased implicitness. A downstream fine-tuning of CITD on the generated data is presented as an application showing robustness gains. No equations, fitted parameters, mathematical derivations, or load-bearing self-citations appear that would reduce any claimed result to a tautology or input by construction. All reported quantities are direct experimental measurements from the described pipeline and are independently replicable or falsifiable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Findings of the Association for Computational Linguistics: EMNLP 2020 , year =
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , year =. doi:10.18653/v1/2020.findings-emnlp.300 , pages =
-
[2]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =
Red Teaming Language Models with Language Models , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2022.emnlp-main.225 , url =
-
[3]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author =. arXiv preprint arXiv:2209.07858 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
International Conference on Machine Learning , year =
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author =. International Conference on Machine Learning , year =
-
[5]
doi: 10.18653/v1/2023.emnlp-main.84
Unveiling the Implicit Toxicity in Large Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2023.emnlp-main.84 , url =
-
[6]
Wiegand, Michael and Ruppenhofer, Josef and Eder, Elisabeth , booktitle =. Implicitly Abusive Language. 2021 , address =. doi:10.18653/v1/2021.naacl-main.48 , pages =
-
[7]
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2021.emnlp-main.29 , pages =
-
[8]
Social Bias Frames: Reasoning about Social and Power Implications of Language , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =. doi:10.18653/v1/2020.acl-main.486 , url =
-
[9]
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , year =. doi:10.18653/v1/2021.acl-long.132 , url =
-
[10]
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/2022.acl-long.234 , url =
-
[11]
Findings of the Association for Computational Linguistics: EMNLP 2022 , year =
Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , year =. doi:10.18653/v1/2022.findings-emnlp.262 , url =
-
[12]
Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie , booktitle =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.796 , pages =
-
[13]
Online Social Networks and Media , year =
SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection , author =. Online Social Networks and Media , year =
-
[14]
Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/2023.acl-long.898 , url =
-
[15]
Proceedings of the 31st International Conference on Computational Linguistics , year =
SCCD: A Session-based Dataset for Chinese Cyberbullying Detection , author =. Proceedings of the 31st International Conference on Computational Linguistics , year =
-
[16]
Wang, Hongbo and Li, Mingda and Lu, Junyu and Xia, Hebin and Yang, Liang and Xu, Bo and Liu, Ruizhu and Lin, Hongfei , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.406 , pages =
-
[17]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =
ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2024.emnlp-main.345 , url =
-
[18]
Findings of the Association for Computational Linguistics: ACL 2025 , year =
Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =. doi:10.18653/v1/2025.findings-acl.742 , url =
-
[19]
arXiv preprint arXiv:2505.22184 , year =
Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon , author =. arXiv preprint arXiv:2505.22184 , year =
-
[20]
Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =. doi:10.18653/v1/2025.emnlp-industry.172 , url =
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2304.10436 , year =
Safety Assessment of Chinese Large Language Models , author =. arXiv preprint arXiv:2304.10436 , year =
-
[24]
arXiv preprint arXiv:2307.09705 , year =
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility , author =. arXiv preprint arXiv:2307.09705 , year =
-
[25]
Lin, Zi and Wang, Zihan and Tong, Yongqi and Wang, Yangkun and Guo, Yuxin and Wang, Yujia and Shang, Jingbo , booktitle =. 2023 , publisher =. doi:10.18653/v1/2023.findings-emnlp.311 , url =
-
[26]
2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , year =
Jailbreaking Black Box Large Language Models in Twenty Queries , author =. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , year =. doi:10.1109/SaTML64287.2025.00010 , url =
-
[27]
Advances in Neural Information Processing Systems , year =
Jailbroken: How Does LLM Safety Training Fail? , author =. Advances in Neural Information Processing Systems , year =
-
[28]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Advances in Neural Information Processing Systems , volume =
Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =
work page 2022
-
[30]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a helpful and harmless assistant with reinforcement learning from human feedback , author =. arXiv preprint arXiv:2204.05862 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , year =
Do-Anything-Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author =. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , year =
work page 2023
-
[32]
The Twelfth International Conference on Learning Representations , year =
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author =. The Twelfth International Conference on Learning Representations , year =
-
[33]
Transactions on Machine Learning Research , year =
Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author =. Transactions on Machine Learning Research , year =
-
[34]
The Twelfth International Conference on Learning Representations , year =
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation , author =. The Twelfth International Conference on Learning Representations , year =
-
[35]
Advances in Neural Information Processing Systems , year =
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author =. Advances in Neural Information Processing Systems , year =
-
[36]
Mitigating harm in language models with conditional-likelihood filtration
Mitigating harm in language models with conditional-likelihood filtration , author =. arXiv preprint arXiv:2108.07790 , year =
-
[37]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[38]
arXiv preprint arXiv:2403.18314 , year =
Chinese Offensive Language Detection: Current Status and Future Directions , author =. arXiv preprint arXiv:2403.18314 , year =
-
[39]
Bai, Zewen and Yang, Liang and Yin, Shengdi and Lu, Junyu and Zeng, Jingjie and Zhu, Haohao and Sun, Yuanyuan and Lin, Hongfei , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-acl.532 , url =
-
[40]
Proceedings of the 2nd Workshop on Abusive Language Online , year =
Decipherment for Adversarial Offensive Language Detection , author =. Proceedings of the 2nd Workshop on Abusive Language Online , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.