pith. sign in

arxiv: 2605.22258 · v1 · pith:AZ2NQUKUnew · submitted 2026-05-21 · 💻 cs.CL

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Pith reviewed 2026-05-22 05:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords Chinese toxicityimplicit attacksLLM safetyred teamingobfuscationcontent moderationadversarial examplesdefense training
0
0 comments X

The pith

A three-stage rewriting process generates Chinese toxicity samples that evade detectors while preserving harmful intent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CITA as a controlled way to create red-team evaluation samples and defense training data for Chinese toxicity in large language models. The method learns the original harmful intent, then rewrites the text to be more implicit and adds obfuscated surface variants through three sequential stages. When these samples are fed to seven existing detectors, the average attack success rate reaches 69.48 percent, showing many toxic cases go undetected. Human reviewers confirm the rewritten versions remain harmful yet appear more indirect and evasive than the starting prompts. The authors also fine-tune a defense model called CITD on the generated data and report improved robustness to implicit toxicity.

Core claim

The authors establish that their CITA method, through harmful intent learning, implicit toxicity enhancement, and obfuscation variant rewriting, produces evaluation samples where detectors exhibit substantial missed-detection risks with an average ASR of 69.48 percent. Human evaluation confirms that the generated texts preserve harmfulness while increasing implicitness and evasiveness. As a practical application, fine-tuning the CITD model with CITA-generated data improves robustness against implicit toxicity.

What carries the argument

The CITA framework's three-stage pipeline that preserves harmful intent, boosts implicitness, and introduces controlled surface variants to create harder-to-detect toxicity examples.

If this is right

  • Existing Chinese toxicity detectors show high vulnerability to these implicitly enhanced and obfuscated samples.
  • Human raters rate the rewritten samples as comparably harmful but more implicit and evasive than originals.
  • Fine-tuning defense models on CITA data leads to better performance against implicit toxicity attacks.
  • The approach provides a way to generate red-team data for improving safety in Chinese LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current detection methods may over-rely on explicit keywords and surface patterns rather than semantic intent.
  • Similar generation techniques could be adapted for other languages to test and strengthen toxicity filters.
  • Regularly incorporating such adversarial data into training loops might become necessary for maintaining effective content moderation.

Load-bearing premise

The rewriting stages can increase implicitness and obfuscation while fully preserving the original harmful intent and without creating artificial patterns that make the attack success rates appear higher than they would be in real use.

What would settle it

A detector that maintains high detection rates on the CITA-generated samples comparable to explicit ones, or human evaluators who rate the samples as less harmful or not more implicit than the originals, would contradict the main findings.

Figures

Figures reproduced from arXiv: 2605.22258 by Bo Xu, Hongbo Wang, Hongfei Lin, Jingyi Kang, Junyu Lu, Linlin Zong, Roy Ka-Wei Lee.

Figure 1
Figure 1. Figure 1: Illustration of Chinese explicit and implicit [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the controlled CITA red-team framework for Chinese implicit toxicity evaluation and defense-data generation, including Harmful Intent Learning, Implicit Toxicity Enhancement, and Obfuscation Variant Rewriting. The model first learns to generate harmful responses in natural contexts, then increases semantic indirectness through reward-guided optimization, and finally applies multiple obfuscation… view at source ↗
Figure 3
Figure 3. Figure 3: Human evaluation of generated Chinese toxic [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Abridged prompt template used for red-team response generation. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used by the LLM judge to compute the Implicit-Toxicity Quality Reward in the Implicit Toxicity [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt used for blue-model defense inference. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Abridged prompt templates used for obfuscation-variant rewriting. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ASR comparison across Qwen3 detector scales. Public denotes the averaged ASR over the five public toxicity datasets reported in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study of gender-related implicit bias and its obfuscation variants. The [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Chinese Implicit Toxicity Attack (CITA), a three-stage red-teaming framework (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) for generating implicit and obfuscated Chinese toxicity samples. It reports that these samples achieve an average attack success rate (ASR) of 69.48% against seven toxicity detectors, with human evaluation confirming preserved harmfulness alongside increased implicitness and evasiveness. As a downstream application, the authors fine-tune a Chinese Implicit Toxicity Defense (CITD) model on CITA-generated data and claim improved robustness.

Significance. If the three-stage process reliably preserves original harmful intent while controllably increasing implicitness and surface variants, the work would provide a useful controlled method for red-teaming Chinese toxicity detectors and generating defense training data. This addresses a genuine gap in non-explicit toxicity evaluation for Chinese, where semantic indirectness and obfuscation are common. The downstream CITD fine-tuning result, if reproducible, would demonstrate practical utility of the generated data.

major comments (2)
  1. [Method / §3 (three-stage process)] The central claim of 69.48% average ASR and preserved harmfulness rests on the three-stage process (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) keeping original intent fixed. However, the manuscript provides no quantitative intent-preservation metrics (e.g., embedding cosine similarity, NLI entailment scores, or separate human intent ratings) or ablations comparing input-output semantic fidelity. Without these, it is unclear whether elevated ASR and human ratings reflect the intended mechanism or uncontrolled semantic shifts introduced by the LLM rewriting steps.
  2. [Abstract and §4 (evaluation)] The abstract and evaluation sections report concrete ASR numbers and human confirmation but omit details on sample generation mechanics (prompt templates, temperature settings, number of variants per seed), baseline detector implementations, statistical significance tests for the 69.48% figure, or exclusion criteria for the evaluation set. These omissions make it impossible to assess whether post-hoc choices or unstated assumptions inflate the reported missed-detection risks.
minor comments (2)
  1. [§4] Clarify the exact number of CITA-generated samples used for the seven-detector evaluation and for CITD fine-tuning; state whether the same seeds were used across stages or whether new harmful intents were introduced.
  2. [Introduction] The claim that CITA is 'not a deployable evasion tool' should be supported by explicit discussion of safeguards or limitations on release of the generated data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects for improving the clarity and rigor of our work on Chinese Implicit Toxicity Attack (CITA). We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method / §3 (three-stage process)] The central claim of 69.48% average ASR and preserved harmfulness rests on the three-stage process (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) keeping original intent fixed. However, the manuscript provides no quantitative intent-preservation metrics (e.g., embedding cosine similarity, NLI entailment scores, or separate human intent ratings) or ablations comparing input-output semantic fidelity. Without these, it is unclear whether elevated ASR and human ratings reflect the intended mechanism or uncontrolled semantic shifts introduced by the LLM rewriting steps.

    Authors: We agree that quantitative evidence for intent preservation would provide stronger support for the mechanism. While the current human evaluation includes explicit ratings on harmfulness preservation (confirming that generated samples retain the original harmful intent), we acknowledge the value of additional automated metrics. In the revised manuscript, we will add embedding cosine similarity scores between seed inputs and final outputs using a Chinese sentence embedding model, along with NLI entailment scores to measure semantic fidelity. We will also include an ablation study that isolates the effect of each stage on ASR while reporting intent preservation metrics, to demonstrate that the observed attack success arises from increased implicitness and obfuscation rather than unintended semantic drift. revision: yes

  2. Referee: [Abstract and §4 (evaluation)] The abstract and evaluation sections report concrete ASR numbers and human confirmation but omit details on sample generation mechanics (prompt templates, temperature settings, number of variants per seed), baseline detector implementations, statistical significance tests for the 69.48% figure, or exclusion criteria for the evaluation set. These omissions make it impossible to assess whether post-hoc choices or unstated assumptions inflate the reported missed-detection risks.

    Authors: We recognize that these implementation details are essential for reproducibility and for allowing readers to evaluate potential biases. In the revised manuscript, we will expand the experimental setup section to include: (i) the exact prompt templates used in each of the three stages, (ii) generation hyperparameters such as temperature (set to 0.7) and the number of variants generated per seed (five variants), (iii) precise descriptions of the seven baseline detectors including model versions, fine-tuning details, and decision thresholds, (iv) statistical significance testing (e.g., bootstrap confidence intervals and paired tests) for the reported 69.48% average ASR, and (v) explicit exclusion criteria for the evaluation set, such as filtering samples that fail preliminary human checks for intent preservation. These additions will be placed in a new subsection under §4 to improve transparency without altering the core results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical generation and evaluation framework

full rationale

The paper introduces an empirical three-stage LLM-based generation process (Harmful Intent Learning, Implicit Toxicity Enhancement, Obfuscation Variant Rewriting) to produce Chinese implicit toxicity samples for red-teaming. It then measures attack success rates (average ASR 69.48%) on seven existing detectors and conducts human evaluations for preserved harmfulness and increased implicitness. A downstream fine-tuning of CITD on the generated data is presented as an application showing robustness gains. No equations, fitted parameters, mathematical derivations, or load-bearing self-citations appear that would reduce any claimed result to a tautology or input by construction. All reported quantities are direct experimental measurements from the described pipeline and are independently replicable or falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5719 in / 1312 out tokens · 50011 ms · 2026-05-22T05:52:53.365769+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

  1. [1]

    Findings of the Association for Computational Linguistics: EMNLP 2020 , year =

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , year =. doi:10.18653/v1/2020.findings-emnlp.300 , pages =

  2. [2]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =

    Red Teaming Language Models with Language Models , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2022.emnlp-main.225 , url =

  3. [3]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author =. arXiv preprint arXiv:2209.07858 , year =

  4. [4]

    International Conference on Machine Learning , year =

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author =. International Conference on Machine Learning , year =

  5. [5]

    doi: 10.18653/v1/2023.emnlp-main.84

    Unveiling the Implicit Toxicity in Large Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2023.emnlp-main.84 , url =

  6. [6]

    Implicitly Abusive Language

    Wiegand, Michael and Ruppenhofer, Josef and Eder, Elisabeth , booktitle =. Implicitly Abusive Language. 2021 , address =. doi:10.18653/v1/2021.naacl-main.48 , pages =

  7. [7]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

    Latent Hatred: A Benchmark for Understanding Implicit Hate Speech , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2021.emnlp-main.29 , pages =

  8. [8]

    Smith, and Yejin Choi

    Social Bias Frames: Reasoning about Social and Power Implications of Language , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =. doi:10.18653/v1/2020.acl-main.486 , url =

  9. [9]

    Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , year =. doi:10.18653/v1/2021.acl-long.132 , url =

  10. [10]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/2022.acl-long.234 , url =

  11. [11]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , year =

    Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , year =. doi:10.18653/v1/2022.findings-emnlp.262 , url =

  12. [12]

    2022 , address =

    Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie , booktitle =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.796 , pages =

  13. [13]

    Online Social Networks and Media , year =

    SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection , author =. Online Social Networks and Media , year =

  14. [14]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/2023.acl-long.898 , url =

  15. [15]

    Proceedings of the 31st International Conference on Computational Linguistics , year =

    SCCD: A Session-based Dataset for Chinese Cyberbullying Detection , author =. Proceedings of the 31st International Conference on Computational Linguistics , year =

  16. [16]

    2024 , address =

    Wang, Hongbo and Li, Mingda and Lu, Junyu and Xia, Hebin and Yang, Liang and Xu, Bo and Liu, Ruizhu and Lin, Hongfei , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.406 , pages =

  17. [17]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

    ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2024.emnlp-main.345 , url =

  18. [18]

    Findings of the Association for Computational Linguistics: ACL 2025 , year =

    Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =. doi:10.18653/v1/2025.findings-acl.742 , url =

  19. [19]

    arXiv preprint arXiv:2505.22184 , year =

    Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon , author =. arXiv preprint arXiv:2505.22184 , year =

  20. [20]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

    Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =. doi:10.18653/v1/2025.emnlp-industry.172 , url =

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =

  22. [22]

    Qwen3 Technical Report

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  23. [23]

    arXiv preprint arXiv:2304.10436 , year =

    Safety Assessment of Chinese Large Language Models , author =. arXiv preprint arXiv:2304.10436 , year =

  24. [24]

    arXiv preprint arXiv:2307.09705 , year =

    CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility , author =. arXiv preprint arXiv:2307.09705 , year =

  25. [25]

    2023 , publisher =

    Lin, Zi and Wang, Zihan and Tong, Yongqi and Wang, Yangkun and Guo, Yuxin and Wang, Yujia and Shang, Jingbo , booktitle =. 2023 , publisher =. doi:10.18653/v1/2023.findings-emnlp.311 , url =

  26. [26]

    2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , year =

    Jailbreaking Black Box Large Language Models in Twenty Queries , author =. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , year =. doi:10.1109/SaTML64287.2025.00010 , url =

  27. [27]

    Advances in Neural Information Processing Systems , year =

    Jailbroken: How Does LLM Safety Training Fail? , author =. Advances in Neural Information Processing Systems , year =

  28. [28]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =

  29. [29]

    Advances in Neural Information Processing Systems , volume =

    Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  30. [30]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author =. arXiv preprint arXiv:2204.05862 , year =

  31. [31]

    Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , year =

    Do-Anything-Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author =. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , year =

  32. [32]

    The Twelfth International Conference on Learning Representations , year =

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author =. The Twelfth International Conference on Learning Representations , year =

  33. [33]

    Transactions on Machine Learning Research , year =

    Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author =. Transactions on Machine Learning Research , year =

  34. [34]

    The Twelfth International Conference on Learning Representations , year =

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation , author =. The Twelfth International Conference on Learning Representations , year =

  35. [35]

    Advances in Neural Information Processing Systems , year =

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author =. Advances in Neural Information Processing Systems , year =

  36. [36]

    Mitigating harm in language models with conditional-likelihood filtration

    Mitigating harm in language models with conditional-likelihood filtration , author =. arXiv preprint arXiv:2108.07790 , year =

  37. [37]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =

  38. [38]

    arXiv preprint arXiv:2403.18314 , year =

    Chinese Offensive Language Detection: Current Status and Future Directions , author =. arXiv preprint arXiv:2403.18314 , year =

  39. [39]

    2025 , address =

    Bai, Zewen and Yang, Liang and Yin, Shengdi and Lu, Junyu and Zeng, Jingjie and Zhu, Haohao and Sun, Yuanyuan and Lin, Hongfei , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-acl.532 , url =

  40. [40]

    Proceedings of the 2nd Workshop on Abusive Language Online , year =

    Decipherment for Adversarial Offensive Language Detection , author =. Proceedings of the 2nd Workshop on Abusive Language Online , year =