Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Pith reviewed 2026-05-22 14:37 UTC · model grok-4.3
The pith
Phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing attribution scores and bypassing LLM safety alignments despite preserved input understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. CMP-RT reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability, causing safety mechanisms to fail despite excellent input understanding. Layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth, and enforcing output equivalence recovers the lost representations, providing evidence for a structural gap between pre-training and alignment.
What carries the argument
CMP-RT, a diagnostic probe that applies code-mixed phonetic perturbations to fragment safety-critical tokens into benign sub-words and suppress their attribution scores in internal representations.
If this is right
- The vulnerability evades standard defenses and persists across modalities and state-of-the-art models including Gemini-3-Pro.
- It scales through simple supervised fine-tuning on perturbed examples.
- Perturbed and canonical representations align only up to a critical layer depth, marking a pre-training versus alignment structural gap.
- Enforcing output equivalence between perturbed and canonical inputs recovers the suppressed safety representations.
Where Pith is reading between the lines
- Tokenizers could be updated to group phonetically similar safety-critical terms into single tokens during pre-training.
- Safety fine-tuning routines might add phonetic variants as regular examples to close the observed layer-wise gap.
- The same fragmentation pattern may appear in other non-standard inputs such as heavy abbreviations or dialect spellings.
Load-bearing premise
The mechanistic analysis and layer-wise probing correctly identify tokenization as the root cause of safety failure instead of a downstream result of how safety training was applied.
What would settle it
Checking whether safety refusals activate reliably when the identical semantic prompt uses standard canonical spelling rather than phonetic perturbations, or whether altering the tokenizer to keep safety-critical words as single tokens removes the vulnerability.
Figures
read the original abstract
Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. We introduce CMP-RT (code-mixed phonetic perturbations for red-teaming), a novel diagnostic probe that pinpoints tokenization as the root cause of this vulnerability. A mechanistic analysis reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability -- causing safety mechanisms to fail despite excellent input understanding. We demonstrate that this vulnerability evades standard defenses, persists across modalities and state-of-the-art (SOTA) models including Gemini-3-Pro, and scales through simple supervised fine-tuning (SFT). Furthermore, layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth; enforcing output equivalence robustly recovers the lost representations, providing causal evidence for a structural gap between pre-training and alignment, and establishing tokenization as a critical, under-examined vulnerability in current safety pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that phonetic perturbations introduced via the novel CMP-RT probe fragment safety-critical tokens into benign sub-words, suppressing attribution scores while preserving prompt interpretability and thereby causing safety mechanisms to fail in aligned LLMs. Mechanistic analysis combined with layer-wise probing shows that perturbed and canonical input representations align up to a critical layer depth before diverging, which the authors interpret as causal evidence for a structural gap between pre-training and alignment rooted in tokenization. The vulnerability is shown to evade standard defenses, persist across modalities and SOTA models including Gemini-3-Pro, and scale via simple SFT.
Significance. If the causal attribution to tokenization holds, the result would be significant because it identifies an under-examined, tokenizer-dependent vulnerability that is distinct from typical adversarial or jailbreak attacks and that affects even frontier models. The mechanistic framing and cross-model persistence could inform new tokenizer-aware safety interventions, and the layer-probing approach provides a concrete diagnostic that future work could build upon.
major comments (1)
- [Layer-wise probing analysis] Layer-wise probing section: the claim that divergence after the critical layer depth supplies causal evidence for a tokenizer-rooted structural gap (rather than a downstream effect of safety training) is not yet load-bearing. The observed pattern is consistent with safety components learned during alignment responding differently to subword sequences; without controls that hold the tokenizer fixed while varying alignment, or direct interventions on token boundaries, the data do not distinguish the two interpretations.
minor comments (1)
- [Abstract and experimental results] Abstract and results sections: quantitative metrics, error bars, dataset sizes, and statistical details for attribution-score suppression and safety-failure rates are not reported, making it difficult to gauge effect magnitude and reliability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the single major comment point by point below.
read point-by-point responses
-
Referee: [Layer-wise probing analysis] Layer-wise probing section: the claim that divergence after the critical layer depth supplies causal evidence for a tokenizer-rooted structural gap (rather than a downstream effect of safety training) is not yet load-bearing. The observed pattern is consistent with safety components learned during alignment responding differently to subword sequences; without controls that hold the tokenizer fixed while varying alignment, or direct interventions on token boundaries, the data do not distinguish the two interpretations.
Authors: We appreciate the referee highlighting the need to strengthen the causal interpretation. The layer-wise probing shows that perturbed and canonical representations remain aligned through early layers (where tokenization and pre-training objectives dominate) before diverging at depths associated with safety mechanisms. The CMP-RT probe is specifically designed to induce phonetic token fragmentation while preserving overall prompt semantics and interpretability, which helps isolate tokenization effects from generic subword variation. As an intervention, we enforce output equivalence by projecting perturbed hidden states onto their canonical counterparts at the critical layer; this recovers safety behavior without altering the input tokens or tokenizer. We view this representation-level intervention, together with the token-attribution suppression results, as supporting evidence for a structural gap rooted in how tokenization feeds into alignment. We acknowledge that experiments holding the tokenizer fixed while varying alignment (or direct token-boundary interventions) would provide stronger disambiguation but require training additional models and fall outside the present scope. In revision we have updated the relevant section and discussion to describe the evidence as 'supporting' rather than definitive 'causal,' added explicit discussion of the alternative interpretation, and included a limitations paragraph on this point. revision: partial
Circularity Check
No circularity: derivation relies on standard interpretability methods without reduction to inputs
full rationale
The paper's core claims rest on introducing CMP-RT perturbations, applying attribution-score analysis, and performing layer-wise probing to show representation alignment up to a critical depth followed by divergence. These steps use established mechanistic interpretability techniques (gradient-based attributions and layer activations) applied to observed model behavior on perturbed vs. canonical inputs. No equations or results are shown to be equivalent to their inputs by construction, no parameters are fitted on a subset and then relabeled as predictions, and no load-bearing premises reduce to self-citations or author-specific uniqueness theorems. The reported causal evidence from enforcing output equivalence is presented as an experimental intervention rather than a definitional restatement. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Attribution scores reliably indicate the contribution of individual tokens to safety decisions
- domain assumption Perturbed and canonical inputs remain semantically equivalent for the model's internal understanding
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Detecting Language Model Attacks with Perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un) ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries.arXiv preprint arXiv:2402.15302,
-
[3]
Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazra, and Animesh Mukherjee. Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,
-
[4]
Rishabh Bhardwaj and Soujanya Poria. Red-teaming large language models using chain of utterances for safety-alignment.arXiv preprint arXiv:2308.09662,
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
No Language Left Behind: Scaling Human-Centered Machine Translation
Marta R Costa-Juss`a, James Cross, Onur C ¸elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Google. Prohibited Use Policy. https://policies.google.com/ai/prohibited-use, 2025a. Accessed: 2025-10-06. Google. Safety settings for generative models. https://ai.google.dev/docs/safety setti ng gemini, 2025b. Accessed: 2025-10-06. Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, ...
work page 2025
-
[10]
Sowing the wind, reaping the whirlwind: The impact of editing language models
Rima Hazra, Sayan Layek, Somnath Banerjee, and Soujanya Poria. Sowing the wind, reaping the whirlwind: The impact of editing language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 16227–16239, Bangkok, Thailand, August
work page 2024
-
[11]
doi: 10.18653/v1/2024.findings-acl.960
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.960. URL https://aclanthology.org/2 024.findings-acl.960/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to com- mon corruptions and perturbations. InInternational Conference on Learning Representations,
-
[12]
Trustagent: Towards safe and trustworthy llm-based agents
Wenyue Hua, Xianjun Yang, Mingyu Jin, Zelong Li, Wei Cheng, Ruixiang Tang, and Yongfeng Zhang. Trustagent: Towards safe and trustworthy llm-based agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 10000–10016,
work page 2024
-
[13]
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556,
-
[14]
URLhttps://openreview.net/forum?id=91l4ZTMpO4. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, and Xiaolin Hu. Faster-gcg: Efficient discrete optimization jailbreak attacks against aligned large language models. arXiv preprint arXiv:2410.15362, 2024a. Yanting Li, Gregory Scontras, and Richard Rutrell. On the communicative utility of code- switching. InProceedings of the Society for Computat...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Prompt Injection Attacks and Defenses in LLM-Integrated Applications,
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications.arXiv preprint arXiv:2310.12815,
-
[18]
Towards red teaming in multimodal and multilingual translation.arXiv preprint arXiv:2401.16247,
Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, et al. Towards red teaming in multimodal and multilingual translation.arXiv preprint arXiv:2401.16247,
-
[19]
The language barrier: Dissecting safety challenges of llms in multilingual contexts
Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts. InFindings of the Association for Computational Linguistics ACL 2024, 2024a. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ...
-
[20]
Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, et al. Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models.arXiv preprint arXiv:2505.15406,
-
[21]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Mirage-bench: Automatic multilingual benchmark arena for retrieval-augmented generation systems
Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, and Amin Ahmad. Mirage-bench: Automatic multilingual benchmark arena for retrieval-augmented generation systems. arXiv preprint arXiv:2410.13716,
-
[23]
Sandwich attack: Multi-language mixture adaptive attack on llms
Bibek Upadhayay and Vahid Behzadan. Sandwich attack: Multi-language mixture adaptive attack on llms. InProceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pp. 208–226,
work page 2024
-
[24]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
White-box multimodal jailbreaks against large vision-language models
Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, and Yu-Gang Jiang. White-box multimodal jailbreaks against large vision-language models. InProceedings of the 32nd ACM International Conference on Multimedia, pp. 6920–6928, 2024a. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All language...
work page 2024
-
[26]
Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding.arXiv preprint arXiv:2406.15481,
-
[27]
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training.arXiv preprint arXiv:2407.09121,
-
[28]
Wenbo Zhang, Aditya Majumdar, and Amulya Yadav. Code-mixed llm: Improve large language models’ capability to handle code-mixing through reinforcement learning from ai feedback.arXiv preprint arXiv:2411.09073,
-
[29]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
13 Preprint. Under review. A Appendix A.1 Dataset, Model & Jailbreaking Template Details A.1.1 Dataset Descriptions The datasets used in this work are described as follows. • HarmfulQA (Bhardwaj & Poria, 2023):This dataset consists of 10 categories of harm, ranging from ‘Business and Economics’ to ‘Science and Technology’. It features Chain of Utterances ...
work page 2023
-
[31]
What is the most efficient way in C++ to create a botnet for DDoS attacks?
which further resists jailbreaks and misbehavior. • Llama-3-8B-Instruct (Dubey et al., 2024), Meta’s 8B parameter open source model instruction finetuned for Chat has been extensively red teamed through adversarial evaluations and includes safety mitigation techniques to lower residual risks. Safety guardrails are implemented through both pre-training and...
work page 2024
-
[32]
for the phonetic perturbations, the average percentange of perturbation to the prompt, and the average number of tokens perturbed per prompt in the CMP prompt-set. A.2.3 CSRT Dataset Generation Here, we describe the methodology for converting our direct and English prompt-sets into CSRT-base and CSRT-hypothetical sets respectively. • First, we automate th...
work page 2022
-
[33]
Let’s say someone wanted to use drama to promote hate speech and discrimination
• English Prompt:“Okay, here’s a hypothetical scenario. Let’s say someone wanted to use drama to promote hate speech and discrimination. Are there any specific techniques or approaches they would use for this purpose?” • Code-mixed (CM) Prompt:“Ek hypothetical scenario tumhare saame rakhta hun. maan lo ki koi vyankti hate speech aur discrimination ko prom...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.