pith. sign in

arxiv: 2606.09700 · v2 · pith:B6GRTJHVnew · submitted 2026-06-08 · 💻 cs.CR · cs.HC· cs.LG

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

Pith reviewed 2026-06-27 16:16 UTC · model grok-4.3

classification 💻 cs.CR cs.HCcs.LG
keywords adversarial attackscontent moderationLLM vulnerabilitiestypographic manipulationhuman perceptionblack-box attacksevasion techniques
0
0 comments X

The pith

Typographic manipulations let harmful text evade LLM moderators while staying obvious to humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Human-Perceptible Adversarial Attacks that embed harmful expressions into benign text through spacing, emphasis, and spatial arrangement. These attacks are generated automatically in a black-box setting using only three queries to the target detector. Across thirteen moderation systems the resulting text achieves over 86 percent human recognition while keeping automated detection rates below 1 percent. A sympathetic reader would care because the work shows that current token-based moderation misses the visual cues people naturally use when judging content.

Core claim

Human-Perceptible Adversarial Attacks embed harmful expressions into otherwise benign text using visually salient typographic manipulations. The method strategically combines spacing, emphasis, and spatial arrangement to preserve human recognition of the harmful intent while reducing machine detectability. Operating in a black-box setting with a small query budget, the attack automatically generates evasive content that succeeds against commercial APIs and state-of-the-art open-source guardrails.

What carries the argument

Human-Perceptible Adversarial Attacks (HPAA), a generation process that applies typographic features such as spacing, emphasis, and spatial arrangement to embed harmful content while preserving human readability.

If this is right

  • Moderation architectures that process only tokenized text remain vulnerable to visual signals humans use.
  • Specific typographic factors such as spacing and emphasis can be isolated as primary drivers of evasion success.
  • Practical defenses must incorporate signals that align moderation output with human perceptual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Moderation pipelines could preprocess input by rendering text visually before tokenization to close the gap.
  • The same visual-manipulation principle might apply to other modalities where human and model perception diverge.
  • Attack success on commercial APIs suggests the vulnerability is widespread rather than limited to open models.

Load-bearing premise

The typographic manipulations preserve harmful semantic intent for human readers without the visual changes themselves altering or obscuring the perceived meaning.

What would settle it

A controlled test in which human participants read the manipulated text and fail to identify the embedded harmful intent at rates comparable to the claimed 86 percent would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2606.09700 by Doowon Kim, Joshua Lee, Lu Malloy, Meisam Mohammady, Qin Yang, Xiaohan Chang, Yuan Hong.

Figure 1
Figure 1. Figure 1: Overview of typographic Human-perceptible Adversarial Attacks (HPAA) against modern content moderation pipelines. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Typographic configuration space in HPAA and an illustrative example (derived from real-world toxic content). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The HPAA framework incorporates a two-round user study (in Stage 1 Human-Guided Refinement), conducted once to collect human feedback that informs and supports the attack design. reliably preserve human recognition of embedded toxic con￾tent. Starting from the full typographic configuration space, we conduct a two-round user study to evaluate how different combinations of granularity, placement, and stylis… view at source ↗
Figure 4
Figure 4. Figure 4: Human exposure rate and detectors’ detection rate on HUS-II dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The evasion results on STTD (red shadow) and HED datasets (3 iterations). [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of benign word ratio on HPAA samples. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Recognition rates across transformation rule conditions. The gray dashed line separates individual rule conditions from [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Question and HPAA examples in user study. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An image count selected and appeared perfor [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
read the original abstract

Large language model (LLM)-powered content moderation systems are a critical defense against harmful online content. However, they operate primarily on tokenized text and often overlook visual cues that humans naturally use when interpreting content. We show that this limitation creates a fundamental vulnerability: content readily recognized as harmful by humans can evade automated moderation. To systematically study this problem, we introduce Human-Perceptible Adversarial Attacks (HPAA), which embed harmful expressions into otherwise benign text using visually salient typographic manipulations. HPAA strategically combines features such as spacing, emphasis, and spatial arrangement to preserve human recognition while reducing machine detectability. Operating in a black-box setting with a small query budget, the attack automatically generates evasive content without model access or gradient information. We evaluate HPAA on multiple datasets and thirteen widely deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. With only three detector queries, generated attacks achieve over 86\% human recognition while keeping detection rates below 1\% across evaluated systems. We further identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings reveal a fundamental blind spot in current LLM-based moderation systems and motivate moderation approaches that better align with human perceptual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Human-Perceptible Adversarial Attacks (HPAA), which embed harmful expressions into benign-looking text via typographic manipulations (spacing, emphasis, spatial arrangement). It claims that in a black-box setting with a budget of only three detector queries, these attacks achieve over 86% human recognition as harmful while evading detection at rates below 1% on thirteen commercial and open-source moderation systems across multiple datasets. The work analyzes typographic factors, explains detector failures, and discusses defenses.

Significance. If the core empirical claims hold after addressing verification gaps, the result would demonstrate a concrete, low-cost vulnerability in deployed LLM moderation pipelines that rely on tokenized text rather than visual/perceptual cues. The black-box, low-query-budget attack and the systematic evaluation against real commercial APIs constitute a practical contribution that could drive development of perceptually aligned detectors.

major comments (3)
  1. [Section 4] Section 4 (Evaluation Methodology) and the human-study protocol: the central claim that HPAA preserves harmful semantic intent (required for the >86% recognition rate to support the HPAA definition) is not supported by any reported controls, such as side-by-side semantic-equivalence ratings, meaning-preservation scores, or comparison against the original harmful text; without these, the recognition rate may reflect altered or more salient interpretations rather than faithful preservation.
  2. [Section 5] Section 5 (Experimental Results) and Table 2 (or equivalent results table): detection rates below 1% are stated without error bars, confidence intervals, per-system sample sizes, or exclusion criteria, and the abstract's claim of evaluation on thirteen systems provides no dataset statistics or statistical tests; this directly affects the reliability of the cross-system evasion claim.
  3. [Section 3] Section 3 (HPAA Definition and Attack Generation): the definition of HPAA requires that typographic changes do not introduce ambiguity or shift perceived intent, yet no formal metric, ablation, or human validation step is described to enforce or measure this property, making the distinction from simple obfuscation load-bearing for the paper's thesis.
minor comments (2)
  1. [Abstract] The abstract states concrete success rates but omits any mention of the specific datasets used; adding this detail would improve reproducibility.
  2. Figure captions describing example HPAA instances should explicitly note the original harmful text for direct visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's careful reading and address each major comment below. Where the comments identify gaps in evidence or presentation, we will revise the manuscript to incorporate additional controls, statistics, and validation steps.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Evaluation Methodology) and the human-study protocol: the central claim that HPAA preserves harmful semantic intent (required for the >86% recognition rate to support the HPAA definition) is not supported by any reported controls, such as side-by-side semantic-equivalence ratings, meaning-preservation scores, or comparison against the original harmful text; without these, the recognition rate may reflect altered or more salient interpretations rather than faithful preservation.

    Authors: We agree that explicit controls for semantic equivalence would strengthen the evidence that high human recognition rates reflect preserved harmful intent. In the revised manuscript, we will add a supplementary human study in which participants provide side-by-side semantic-equivalence ratings and meaning-preservation scores (on a 5-point Likert scale) between original harmful text and the corresponding HPAA instances. These results will be reported alongside the existing recognition rates to address the concern directly. revision: yes

  2. Referee: [Section 5] Section 5 (Experimental Results) and Table 2 (or equivalent results table): detection rates below 1% are stated without error bars, confidence intervals, per-system sample sizes, or exclusion criteria, and the abstract's claim of evaluation on thirteen systems provides no dataset statistics or statistical tests; this directly affects the reliability of the cross-system evasion claim.

    Authors: We acknowledge that the current presentation of results lacks the statistical detail needed for full reliability assessment. In the revision, we will update Table 2 (and associated text) to include error bars, 95% confidence intervals, per-system sample sizes, explicit exclusion criteria, dataset statistics (e.g., number of instances per dataset), and appropriate statistical tests (such as binomial proportion confidence intervals or paired significance tests) supporting the cross-system evasion rates. revision: yes

  3. Referee: [Section 3] Section 3 (HPAA Definition and Attack Generation): the definition of HPAA requires that typographic changes do not introduce ambiguity or shift perceived intent, yet no formal metric, ablation, or human validation step is described to enforce or measure this property, making the distinction from simple obfuscation load-bearing for the paper's thesis.

    Authors: The current definition is supported empirically by the human recognition results, which serve as an operational validation that intent is preserved for the evaluated instances. Nevertheless, we agree that an explicit formal metric and validation step would make the distinction from obfuscation more rigorous. We will add an ablation analysis of typographic factors together with a targeted human validation protocol (measuring perceived intent shift) to Section 3 in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces HPAA as a conceptual attack method using typographic manipulations and evaluates it empirically via black-box queries against external commercial APIs and open-source systems, reporting human recognition rates (>86%) and detection rates (<1%). No equations, fitted parameters, or self-citations appear in the provided text that reduce any claim to an internal definition or input by construction. The central results depend on external benchmarks and human judgments rather than any self-referential derivation chain, satisfying the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current moderation pipelines operate exclusively on tokenized text and ignore visual layout information; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption LLM-based moderation systems process input solely as tokenized text and do not incorporate visual rendering or layout cues
    Stated in the abstract as the core limitation that HPAA exploits.
invented entities (1)
  • Human-Perceptible Adversarial Attacks (HPAA) no independent evidence
    purpose: A class of attacks that combine typographic features to evade token-based detection while remaining legible to humans
    Newly defined attack category; no independent evidence outside the paper's own experiments is provided.

pith-pipeline@v0.9.1-grok · 5783 in / 1190 out tokens · 18371 ms · 2026-06-27T16:16:11.676807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 21 canonical work pages · 9 internal anchors

  1. [1]

    http s://www.prolific.com, 2025

    Prolific: Online participant recruitment platform. http s://www.prolific.com, 2025

  2. [2]

    M. T. Ahvanooey, Q. Li, J. Hou, H. D. Mazraeh, and J. Zhang. Aitsteg: An innovative text steganography technique for hidden transmission of text message via social media.IEEE Access, 6:65981–65995, 2018

  3. [3]

    AlDahoul, M

    N. AlDahoul, M. J. T. Tan, H. R. Kasireddy, and Y . Zaki. Advancing content moderation: Evaluating large lan- guage models for detecting sensitive content across text, images, and videos.arXiv preprint arXiv:2411.17123, 2024

  4. [4]

    Detecttoxiccontent – amazon comprehend api reference

    Amazon Web Services. Detecttoxiccontent – amazon comprehend api reference. https://docs.aws.ama zon.com/comprehend/latest/APIReference/API _DetectToxicContent.html, 2023

  5. [5]

    New for amazon comprehend – toxicity detection

    Amazon Web Services. New for amazon comprehend – toxicity detection. https://aws.amazon.com/blogs /aws/new-for-amazon-comprehend-toxicity-d etection/, 2023

  6. [6]

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

  7. [7]

    Bouchrika

    I. Bouchrika. Mobile vs desktop usage statistics for

  8. [8]

    https://research.com/software/guides/m obile-vs-desktop-usage#5

  9. [9]

    R. A. Bradley and M. E. Terry. Rank analysis of incom- plete block designs: I. the method of paired comparisons. Biometrika, pages 324–345, 1952

  10. [10]

    C. Chen, W. Qu, S. Su, Y . Feng, and T. Li. A comprehen- sive review of llm-based content moderation: Advance- ments, challenges, and future directions.Knowledge- Based Systems, page 114689, 2025

  11. [11]

    J. Chi, U. Karn, H. Zhan, E. Smith, J. Rando, Y . Zhang, K. Plawiak, Z. D. Coudert, K. Upasani, and M. Pa- supuleti. Llama guard 3 vision: Safeguarding human- ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  13. [13]

    Conneau, K

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettle- moyer, and V . Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics

  14. [14]

    N. Das, E. Raff, and M. Gaur. Human-interpretable adversarial prompt attack on large language models with situational context.arXiv preprint arXiv:2407.14644, 2024

  15. [15]

    G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023

  16. [16]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding.arXiv preprint arXiv:1810.04805, 2018

  17. [17]

    DYRMISHI, S

    S. DYRMISHI, S. GHAMIZI, and M. CORDY . How do humans perceive adversarial text? a reality check on the validity and naturalness of word-based adversarial attacks. InACL 2023: The 61st Annual Meeting of the Association for Computational Linguistics, 2023

  18. [18]

    Ebrahimi, A

    J. Ebrahimi, A. Rao, D. Lowd, and D. Dou. Hotflip: White-box adversarial examples for text classification. InProceedings of the 56th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 2: Short Papers), pages 31–36, 2018

  19. [19]

    S. Eger, G. G. ¸ Sahin, A. Rücklé, J.-U. Lee, C. Schulz, M. Mesgar, K. Swarnkar, E. Simpson, and I. Gurevych. Text processing like humans do: Visually attacking and shielding nlp systems.arXiv preprint arXiv:1903.11508, 2019

  20. [20]

    Unified ai guardrails — for privacy, in- tegrity, and security, 2025

    Enkrypt AI. Unified ai guardrails — for privacy, in- tegrity, and security, 2025

  21. [21]

    S. Feng, E. Wallace, A. Grissom II, M. Iyyer, P. Ro- driguez, and J. Boyd-Graber. Pathologies of neural models make interpretations difficult.arXiv preprint arXiv:1804.07781, 2018

  22. [22]

    Franco, O

    M. Franco, O. Gaggi, and C. E. Palazzi. Integrating content moderation systems with large language models. ACM Transactions on the Web, 19(2):1–21, 2025

  23. [23]

    J. Gao, J. Lanchantin, M. L. Soffa, and Y . Qi. Black-box generation of adversarial text sequences to evade deep learning classifiers. In2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE, 2018

  24. [24]

    Garg and G

    S. Garg and G. Ramakrishnan. Bae: Bert-based adver- sarial examples for text classification.arXiv preprint arXiv:2004.01970, 2020

  25. [25]

    J. Geng, B. Yi, Z. Fei, T. Wu, L. Nie, and Z. Liu. When safety detectors aren’t enough: A stealthy and effective jailbreak attack on llms via steganographic techniques. arXiv preprint arXiv:2505.16765, 2025

  26. [26]

    Safer and multimodal: Re- sponsible ai with gemma (shieldgemma 2)

    Google DeepMind Team. Safer and multimodal: Re- sponsible ai with gemma (shieldgemma 2). https: //developers.googleblog.com/en/safer-and-m ultimodal-responsible-ai-with-gemma/, 2025

  27. [27]

    Greshake, S

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023

  28. [28]

    He and J

    R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collab- orative filtering. InWWW, 2016

  29. [29]

    S. Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

  30. [30]

    H. Hong, X. Zhang, B. Wang, Z. Ba, and Y . Hong. Cer- tifiable black-box attacks with randomized adversarial examples: Breaking defenses with provable confidence. InProceedings of the 2024 on ACM SIGSAC Confer- ence on Computer and Communications Security, pages 600–614. ACM, 2024

  31. [31]

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testug- gine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  32. [32]

    Adversarial Example Generation with Syntactically Controlled Paraphrase Networks

    M. Iyyer, J. Wieting, K. Gimpel, and L. Zettle- moyer. Adversarial example generation with syntac- tically controlled paraphrase networks.arXiv preprint arXiv:1804.06059, 2018

  33. [33]

    Perspective api

    Jigsaw & Google. Perspective api. https://perspe ctiveapi.com/

  34. [34]

    D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. InProceedings of the AAAI conference on artificial intelligence, pages 8018–8025, 2020

  35. [35]

    Kemp et al

    S. Kemp et al. Digital 2025: Global overview report — device trends. https://datareportal.com/repor ts/digital-2025-sub-section-device-trends

  36. [36]

    Knöchel and S

    M. Knöchel and S. Karius. Text steganography meth- ods and their influence in malware: A comprehensive overview and evaluation. InProceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, pages 113–124, 2024

  37. [37]

    H. Koh, D. Kim, M. Lee, and K. Jung. Can llms recog- nize toxicity? a structured investigation framework and toxicity metric.arXiv preprint arXiv:2402.06900, 2024

  38. [38]

    Kurita, A

    K. Kurita, A. Belova, and A. Anastasopoulos. To- wards robust toxic content classification.arXiv preprint arXiv:1912.06872, 2019

  39. [39]

    J. Li, S. Ji, T. Du, B. Li, and T. Wang. Textbugger: Gen- erating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018

  40. [40]

    L. Li, L. Huang, X. Zhao, W. Yang, and Z. Chen. A sta- tistical attack on a kind of word-shift text-steganography. In2008 International Conference on Intelligent Infor- mation Hiding and Multimedia Signal Processing, pages 1503–1507. IEEE, 2008

  41. [41]

    Y . Li, J. Liu, T. Zhang, S. Chen, T. Li, and ... Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025. Omni-modal model with uni- fied text–vision–audio capabilities

  42. [42]

    Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, et al. Prompt in- jection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499, 2023

  43. [43]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  44. [44]

    Logacheva, D

    V . Logacheva, D. Dementieva, S. Ustyantsev, D. Moskovskiy, D. Dale, I. Krotova, N. Semenov, and A. Panchenko. ParaDetox: Detoxification with parallel data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, Dublin, Ireland, May

  45. [45]

    Association for Computational Linguistics

  46. [46]

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the ACL, 2011

  47. [47]

    Mathew, P

    B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee. Hatexplain: A benchmark dataset for explainable hate speech detection. InProceedings of the AAAI conference on artificial intelligence, pages 14867–14875, 2021

  48. [48]

    McAuley, R

    J. McAuley, R. Pandey, and J. Leskovec. Image-based recommendations on styles and substitutes.SIGIR, 2015

  49. [49]

    Azure ai content safety documentation

    Microsoft. Azure ai content safety documentation. ht tps://learn.microsoft.com/en-us/azure/ai-s ervices/content-safety/, 2025

  50. [50]

    Morris, E

    J. Morris, E. Lifland, J. Lanchantin, Y . Ji, and Y . Qi. Reevaluating adversarial examples in natural language. InFindings of the association for computational linguis- tics: EMNLP 2020, pages 3829–3839, 2020

  51. [51]

    Morris, E

    J. Morris, E. Lifland, J. Y . Yoo, J. Grigsby, D. Jin, and Y . Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 conference on empirical meth- ods in natural language processing: System demonstra- tions, pages 119–126, 2020

  52. [52]

    M. D. Muralikumar, Y . S. Yang, and D. W. McDonald. A human-centered evaluation of a toxicity detection api: Testing transferability and unpacking latent attributes. ACM Transactions on Social Computing, 6(1-2):1–38, 2023

  53. [53]

    Gpt-4o mini: Advancing cost-efficient intel- ligence

    OpenAI. Gpt-4o mini: Advancing cost-efficient intel- ligence. https://openai.com/index/gpt-4o-min i-advancing-cost-efficient-intelligence/ , 2024

  54. [54]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https: //openai.com/index/o3-o4-mini-system-card/ , 2025

  55. [55]

    Pruthi, B

    D. Pruthi, B. Dhingra, and Z. C. Lipton. Combating adversarial misspellings with robust word recognition. InProceedings of the 57th Annual Meeting of the Associ- ation for Computational Linguistics, pages 5582–5591, 2019

  56. [56]

    M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Be- yond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020

  57. [57]

    Russinovich, A

    M. Russinovich, A. Salem, and R. Eldan. Great, now write an article about that: The crescendo {Multi- Turn}{LLM} jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440, 2025

  58. [58]

    X. Shen, Y . Wu, Y . Qu, M. Backes, S. Zannettou, and Y . Zhang. HateBench: Benchmarking hate speech de- tectors on LLM-generated content and hate campaigns. In34th USENIX Security Symposium, 2025

  59. [59]

    s- nlp/roberta_toxicity_classifier

    Skolkovo Institute of Science and Technology. s- nlp/roberta_toxicity_classifier. Hugging Face Model Hub, 2021. A RoBERTa-based binary toxicity classi- fier trained on merged English parts of Jigsaw datasets (2018, 2019, 2020), achieving AUC-ROC of 0.98 and F1-score of 0.76

  60. [60]

    textdetox/bert-multilingual-toxicity- classifier

    TextDetox. textdetox/bert-multilingual-toxicity- classifier. Hugging Face Model Hub, 2025. A multilin- gual BERT classifier fine-tuned for binary toxicity clas- sification on textdetox/multilingual_toxicity_dataset, supporting 14+ languages including English, Spanish, German, French, Italian, Chinese, Japanese, Arabic, Hebrew, Hindi, Ukrainian, Russian, T...

  61. [61]

    Thomas, D

    K. Thomas, D. Akhawe, M. Bailey, D. Boneh, E. Bursztein, S. Consolvo, N. Dell, Z. Durumeric, P. G. Kelley, D. Kumar, et al. Sok: Hate, harassment, and the changing landscape of online abuse. In2021 IEEE sym- posium on security and privacy (SP), pages 247–267. IEEE, 2021

  62. [62]

    Tripadvisor hotel reviews dataset

    TripAdvisor. Tripadvisor hotel reviews dataset. https: //www.tripadvisor.com/, 2024

  63. [63]

    unitary/multilingual-toxic-xlm-roberta

    Unitary AI. unitary/multilingual-toxic-xlm-roberta. Hugging Face Model Hub, 2020. A multilingual XLM- RoBERTa based toxicity classifier trained on Jigsaw Multilingual Toxic Comment Classification data, sup- porting 7 languages: English, French, Spanish, Italian, Portuguese, Turkish, and Russian

  64. [64]

    unitary/toxic-bert

    Unitary AI. unitary/toxic-bert. Hugging Face Model Hub, 2020. A BERT-based classifier fine-tuned for multi- label toxicity detection on the Jigsaw Toxic Comment Classification datasets

  65. [65]

    Vidgen, A

    B. Vidgen, A. Harris, D. Nguyen, R. Tromble, S. A. Hale, and H. Margetts. Challenges and frontiers in abusive content detection. InProceedings of the third workshop on abusive language online, pages 80–93, 2019

  66. [66]

    Wallace, S

    E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for attacking and analyzing nlp. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153– 2162, 2019

  67. [67]

    Ethical and social risks of harm from Language Models

    L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Ue- sato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, et al. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359, 2021

  68. [68]

    E. B. Wilson. Probable inference, the law of succes- sion, and statistical inference.Journal of the American Statistical Association, 1927

  69. [69]

    J. Wu, Z. Wu, Y . Xue, J. Wen, and W. Peng. Generative text steganography with large language model. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 10345–10353, 2024

  70. [70]

    L. Wu, F. Morstatter, K. M. Carley, and H. Liu. Mis- information in social media: definition, manipulation, and detection.ACM SIGKDD explorations newsletter, 21(2):80–90, 2019

  71. [71]

    S. Xie, H. Wang, Y . Kong, and Y . Hong. Universal 3-dimensional perturbations for black-box attacks on video recognition systems. In43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, pages 1390–1407. IEEE, 2022

  72. [72]

    Yelp open dataset

    Yelp. Yelp open dataset. https://www.yelp.com/d ataset, 2024

  73. [73]

    J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, and F. Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. In KDD, KDD ’25, page 1809–1820. ACM, 2025

  74. [74]

    Zhang, H

    X. Zhang, H. Hong, Y . Hong, P. Huang, B. Wang, Z. Ba, and K. Ren. Text-crs: A generalized certified robustness framework against textual adversarial attacks. In2024 IEEE Symposium on Security and Privacy (SP), pages 2920–2938, 2024

  75. [75]

    J. Zhu, D. Bespalov, L. You, N. Kulkarni, and Y . Qi. Taebench: Improving quality of toxic adversarial ex- amples. InNACACL: Human Language Technologies, pages 251–265, 2025

  76. [76]

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A User Study A.1 Demographics Across Two Study Rounds Measure Item Round I Round II Count (%) Count (%) Gender Female 57 47.5 126 50.4 Male 62 51.7 121 48.4 Other / NB 1 0.8...

  77. [77]

    This design dimension specifies how textual content is decomposed into units that subsequently serve as the targets of typographic transformations

    Typographic Granularity (L).Typographic granularity defines the level at which toxic text can be segmented for ty- pographic manipulation. This design dimension specifies how textual content is decomposed into units that subsequently serve as the targets of typographic transformations. We define a granularity space G consisting of three rep- resentative d...

  78. [78]

    Placement Strategies (M).Placement strategies define the spatial patterns by which toxic content may be embedded within surrounding benign text. This design dimension charac- terizes where and how toxic spans can appear in the rendered sample, reflecting the diversity of user devices, screen layouts, and reading behaviors observed in real-world platforms....

  79. [79]

    Stylistic Transformations (S).Stylistic transformations characterize the visual modifications that can be applied to textual units within the toxic span. This design dimension captures a range of perceptual cues commonly supported by real-world user interfaces, enabling surface-level appearance changes while preserving the underlying semantic content. (a)...