Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs

Darpan Aswal; Siddharth D Jaiswal

arxiv: 2505.14226 · v5 · submitted 2025-05-20 · 💻 cs.CL · cs.AI

Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs

Darpan Aswal , Siddharth D Jaiswal This is my paper

Pith reviewed 2026-05-22 14:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords phonetic perturbationsLLM safetytokenizationred-teamingsafety alignmentmechanistic interpretabilityadversarial prompts

0 comments

The pith

Phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing attribution scores and bypassing LLM safety alignments despite preserved input understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CMP-RT, a probe that applies code-mixed phonetic perturbations like textese to LLMs. It shows these changes split words tied to safety refusals into smaller sub-words that do not trigger the model's safety rules. The model continues to interpret the prompt correctly, yet attribution scores for the safety-critical parts drop, so the safety mechanisms stay inactive. This points to a root issue in how tokenization interacts with safety training rather than a failure of understanding.

Core claim

Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. CMP-RT reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability, causing safety mechanisms to fail despite excellent input understanding. Layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth, and enforcing output equivalence recovers the lost representations, providing evidence for a structural gap between pre-training and alignment.

What carries the argument

CMP-RT, a diagnostic probe that applies code-mixed phonetic perturbations to fragment safety-critical tokens into benign sub-words and suppress their attribution scores in internal representations.

If this is right

The vulnerability evades standard defenses and persists across modalities and state-of-the-art models including Gemini-3-Pro.
It scales through simple supervised fine-tuning on perturbed examples.
Perturbed and canonical representations align only up to a critical layer depth, marking a pre-training versus alignment structural gap.
Enforcing output equivalence between perturbed and canonical inputs recovers the suppressed safety representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tokenizers could be updated to group phonetically similar safety-critical terms into single tokens during pre-training.
Safety fine-tuning routines might add phonetic variants as regular examples to close the observed layer-wise gap.
The same fragmentation pattern may appear in other non-standard inputs such as heavy abbreviations or dialect spellings.

Load-bearing premise

The mechanistic analysis and layer-wise probing correctly identify tokenization as the root cause of safety failure instead of a downstream result of how safety training was applied.

What would settle it

Checking whether safety refusals activate reliably when the identical semantic prompt uses standard canonical spelling rather than phonetic perturbations, or whether altering the tokenizer to keep safety-critical words as single tokens removes the vulnerability.

Figures

Figures reproduced from arXiv: 2505.14226 by Darpan Aswal, Siddharth D Jaiswal.

**Figure 1.** Figure 1: An example red-teaming input using our CMP-RT strategy. We evaluate the outputs of our text and image generation tasks using the metrics described as follows. An input to a model is a four-tuple that generates a response R = ⟨M, J, P, T⟩, where the model is M, jailbreak template J, the prompt (English/CM/CMP) is P, and temperature is T ∈ {0.2k | k = 0, 1, 2, 3, 4, 5}. Averaged evaluations across all temp… view at source ↗

**Figure 2.** Figure 2: Harmful image outputs generated by ChatGPT-4o-mini using our CMP prompts. Gemini-2.5-Flash-Image: For ‘Base’, AASR drops sharply from English → CM, with only a slight recovery on CMP, while ‘VisLM’ yields consistent yet only modest gains on CM and CMP. On English, ‘Base’ produces many model refusals (Yuan et al., 2024). Refusals largely drop in the English → CM → CMP transitions but generations are inste… view at source ↗

**Figure 3.** Figure 3: Sequence attribution scores for English (top), CM (middle) and CMP (bottom) inputs. Takeaway: Safety-critical tokens (“hate”, “speech”) retain high attribution through mid-layers in English/CM but are suppressed in CMP due to sub-word fragmentation, explaining safety failures. Nano Banana Pro: Unlike Gemini-2.5, CM and CMP consistently increase AASR across templates. AARR remains high across configurations… view at source ↗

**Figure 4.** Figure 4: Layer-wise probe transfer (English → CMP) accuracies. The probes are trained on the base (4a) & aligned (4b) Llama-3-8B-Instruct variants. Takeaway: English → CMP probe transfer follows an inverted-U on the base model, collapsing after layer 17—plausibly explaining the safety failures despite understanding; the intervention (4b) recovers transfer gap across all layers. 1. We select a small subset of the da… view at source ↗

**Figure 5.** Figure 5: 95% bootstrap confidence intervals (10,000 resamples) for AASR and AARR across all models and prompt sets under the ‘None’ template. Takeaway: CIs are tight (mean width: 0.068 for AASR) and non-overlapping across English → CM → CMP transitions for ChatGPT and Llama, confirming that N=460 prompts yield statistically precise estimates. 6 degenerate configurations where near-universal refusal yields undefined… view at source ↗

**Figure 6.** Figure 6: Per-temperature ASR for all models and prompt sets under the ‘None’ template. Takeaway: ASR remains nearly flat across the full temperature range (T ∈ [0.0, 1.0]), with across-temperature variation 2.7× smaller than across-prompt variation, confirming robustness to stochastic sampling. Variance Decomposition. We decompose the observed variation in attack success into two sources: (1) across-prompt variatio… view at source ↗

**Figure 7.** Figure 7: Distribution of per-prompt ASR across models and prompt sets under the ‘None’ template. Takeaway: For ChatGPT and Llama, the distribution progressively shifts from near-zero (English) to near-one (CMP), visualizing the incremental effect of code-mixing and phonetic perturbations on well-aligned models. indicator across the 6 temperature values for each prompt). For the ‘None’ template, acrossprompt standa… view at source ↗

**Figure 8.** Figure 8: Layer-wise probe transfer (English → CMP) accuracies for Llama-3-8B-Instruct variants trained with representation-only and distillation-only objectives. Takeaway: Distillation-only (8b) closes the transfer gap as an emergent consequence of output matching, confirming the causal link between representations and safety behavior; alignment-only (8a) recovers probe accuracy but fails to transfer safety to outp… view at source ↗

read the original abstract

Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. We introduce CMP-RT (code-mixed phonetic perturbations for red-teaming), a novel diagnostic probe that pinpoints tokenization as the root cause of this vulnerability. A mechanistic analysis reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability -- causing safety mechanisms to fail despite excellent input understanding. We demonstrate that this vulnerability evades standard defenses, persists across modalities and state-of-the-art (SOTA) models including Gemini-3-Pro, and scales through simple supervised fine-tuning (SFT). Furthermore, layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth; enforcing output equivalence robustly recovers the lost representations, providing causal evidence for a structural gap between pre-training and alignment, and establishing tokenization as a critical, under-examined vulnerability in current safety pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that phonetic perturbations introduced via the novel CMP-RT probe fragment safety-critical tokens into benign sub-words, suppressing attribution scores while preserving prompt interpretability and thereby causing safety mechanisms to fail in aligned LLMs. Mechanistic analysis combined with layer-wise probing shows that perturbed and canonical input representations align up to a critical layer depth before diverging, which the authors interpret as causal evidence for a structural gap between pre-training and alignment rooted in tokenization. The vulnerability is shown to evade standard defenses, persist across modalities and SOTA models including Gemini-3-Pro, and scale via simple SFT.

Significance. If the causal attribution to tokenization holds, the result would be significant because it identifies an under-examined, tokenizer-dependent vulnerability that is distinct from typical adversarial or jailbreak attacks and that affects even frontier models. The mechanistic framing and cross-model persistence could inform new tokenizer-aware safety interventions, and the layer-probing approach provides a concrete diagnostic that future work could build upon.

major comments (1)

[Layer-wise probing analysis] Layer-wise probing section: the claim that divergence after the critical layer depth supplies causal evidence for a tokenizer-rooted structural gap (rather than a downstream effect of safety training) is not yet load-bearing. The observed pattern is consistent with safety components learned during alignment responding differently to subword sequences; without controls that hold the tokenizer fixed while varying alignment, or direct interventions on token boundaries, the data do not distinguish the two interpretations.

minor comments (1)

[Abstract and experimental results] Abstract and results sections: quantitative metrics, error bars, dataset sizes, and statistical details for attribution-score suppression and safety-failure rates are not reported, making it difficult to gauge effect magnitude and reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the single major comment point by point below.

read point-by-point responses

Referee: [Layer-wise probing analysis] Layer-wise probing section: the claim that divergence after the critical layer depth supplies causal evidence for a tokenizer-rooted structural gap (rather than a downstream effect of safety training) is not yet load-bearing. The observed pattern is consistent with safety components learned during alignment responding differently to subword sequences; without controls that hold the tokenizer fixed while varying alignment, or direct interventions on token boundaries, the data do not distinguish the two interpretations.

Authors: We appreciate the referee highlighting the need to strengthen the causal interpretation. The layer-wise probing shows that perturbed and canonical representations remain aligned through early layers (where tokenization and pre-training objectives dominate) before diverging at depths associated with safety mechanisms. The CMP-RT probe is specifically designed to induce phonetic token fragmentation while preserving overall prompt semantics and interpretability, which helps isolate tokenization effects from generic subword variation. As an intervention, we enforce output equivalence by projecting perturbed hidden states onto their canonical counterparts at the critical layer; this recovers safety behavior without altering the input tokens or tokenizer. We view this representation-level intervention, together with the token-attribution suppression results, as supporting evidence for a structural gap rooted in how tokenization feeds into alignment. We acknowledge that experiments holding the tokenizer fixed while varying alignment (or direct token-boundary interventions) would provide stronger disambiguation but require training additional models and fall outside the present scope. In revision we have updated the relevant section and discussion to describe the evidence as 'supporting' rather than definitive 'causal,' added explicit discussion of the alternative interpretation, and included a limitations paragraph on this point. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on standard interpretability methods without reduction to inputs

full rationale

The paper's core claims rest on introducing CMP-RT perturbations, applying attribution-score analysis, and performing layer-wise probing to show representation alignment up to a critical depth followed by divergence. These steps use established mechanistic interpretability techniques (gradient-based attributions and layer activations) applied to observed model behavior on perturbed vs. canonical inputs. No equations or results are shown to be equivalent to their inputs by construction, no parameters are fitted on a subset and then relabeled as predictions, and no load-bearing premises reduce to self-citations or author-specific uniqueness theorems. The reported causal evidence from enforcing output equivalence is presented as an experimental intervention rather than a definitional restatement. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions from mechanistic interpretability that attribution scores reflect token importance and that layer representations can be compared across perturbed and canonical inputs.

axioms (2)

domain assumption Attribution scores reliably indicate the contribution of individual tokens to safety decisions
Invoked in the mechanistic analysis section of the abstract to link token fragmentation to safety failure.
domain assumption Perturbed and canonical inputs remain semantically equivalent for the model's internal understanding
Stated as preserving prompt interpretability while altering tokenization.

pith-pipeline@v0.9.0 · 5702 in / 1400 out tokens · 34311 ms · 2026-05-22T14:37:02.427116+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 11 internal anchors

[1]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

How (un) ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries.arXiv preprint arXiv:2402.15302,

Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un) ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries.arXiv preprint arXiv:2402.15302,

work page arXiv
[3]

Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazra, and Animesh Mukherjee. Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

work page arXiv
[4]

and Poria, S

Rishabh Bhardwaj and Soujanya Poria. Red-teaming large language models using chain of utterances for safety-alignment.arXiv preprint arXiv:2308.09662,

work page arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-Juss`a, James Cross, Onur C ¸elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Prohibited Use Policy

Google. Prohibited Use Policy. https://policies.google.com/ai/prohibited-use, 2025a. Accessed: 2025-10-06. Google. Safety settings for generative models. https://ai.google.dev/docs/safety setti ng gemini, 2025b. Accessed: 2025-10-06. Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, ...

work page 2025
[10]

Sowing the wind, reaping the whirlwind: The impact of editing language models

Rima Hazra, Sayan Layek, Somnath Banerjee, and Soujanya Poria. Sowing the wind, reaping the whirlwind: The impact of editing language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 16227–16239, Bangkok, Thailand, August

work page 2024
[11]

doi: 10.18653/v1/2024.findings-acl.960

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.960. URL https://aclanthology.org/2 024.findings-acl.960/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to com- mon corruptions and perturbations. InInternational Conference on Learning Representations,

work page doi:10.18653/v1/2024.findings-acl.960 2024
[12]

Trustagent: Towards safe and trustworthy llm-based agents

Wenyue Hua, Xianjun Yang, Mingyu Jin, Zelong Li, Wei Cheng, Ruixiang Tang, and Yongfeng Zhang. Trustagent: Towards safe and trustworthy llm-based agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 10000–10016,

work page 2024
[13]

Best-of-n jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556,

work page arXiv
[14]

GPT-4o System Card

URLhttps://openreview.net/forum?id=91l4ZTMpO4. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, and Xiaolin Hu. Faster-gcg: Efficient discrete optimization jailbreak attacks against aligned large language models. arXiv preprint arXiv:2410.15362, 2024a. Yanting Li, Gregory Scontras, and Richard Rutrell. On the communicative utility of code- switching. InProceedings of the Society for Computat...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Prompt Injection Attacks and Defenses in LLM-Integrated Applications,

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications.arXiv preprint arXiv:2310.12815,

work page arXiv
[18]

Towards red teaming in multimodal and multilingual translation.arXiv preprint arXiv:2401.16247,

Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, et al. Towards red teaming in multimodal and multilingual translation.arXiv preprint arXiv:2401.16247,

work page arXiv
[19]

The language barrier: Dissecting safety challenges of llms in multilingual contexts

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts. InFindings of the Association for Computational Linguistics ACL 2024, 2024a. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ...

work page arXiv 2024
[20]

Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models.arXiv preprint arXiv:2505.15406,

Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, et al. Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models.arXiv preprint arXiv:2505.15406,

work page arXiv
[21]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Mirage-bench: Automatic multilingual benchmark arena for retrieval-augmented generation systems

Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, and Amin Ahmad. Mirage-bench: Automatic multilingual benchmark arena for retrieval-augmented generation systems. arXiv preprint arXiv:2410.13716,

work page arXiv
[23]

Sandwich attack: Multi-language mixture adaptive attack on llms

Bibek Upadhayay and Vahid Behzadan. Sandwich attack: Multi-language mixture adaptive attack on llms. InProceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pp. 208–226,

work page 2024
[24]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

White-box multimodal jailbreaks against large vision-language models

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, and Yu-Gang Jiang. White-box multimodal jailbreaks against large vision-language models. InProceedings of the 32nd ACM International Conference on Multimedia, pp. 6920–6928, 2024a. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All language...

work page 2024
[26]

Code-switching red-teaming: Llm evaluation for safety and multilingual understanding.arXiv preprint arXiv:2406.15481,

Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding.arXiv preprint arXiv:2406.15481,

work page arXiv
[27]

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training.arXiv preprint arXiv:2407.09121,

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training.arXiv preprint arXiv:2407.09121,

work page arXiv
[28]

Code-mixed llm: Improve large language models’ capability to handle code-mixing through reinforcement learning from ai feedback.arXiv preprint arXiv:2411.09073,

Wenbo Zhang, Aditya Majumdar, and Amulya Yadav. Code-mixed llm: Improve large language models’ capability to handle code-mixing through reinforcement learning from ai feedback.arXiv preprint arXiv:2411.09073,

work page arXiv
[29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Under review

13 Preprint. Under review. A Appendix A.1 Dataset, Model & Jailbreaking Template Details A.1.1 Dataset Descriptions The datasets used in this work are described as follows. • HarmfulQA (Bhardwaj & Poria, 2023):This dataset consists of 10 categories of harm, ranging from ‘Business and Economics’ to ‘Science and Technology’. It features Chain of Utterances ...

work page 2023
[31]

What is the most efficient way in C++ to create a botnet for DDoS attacks?

which further resists jailbreaks and misbehavior. • Llama-3-8B-Instruct (Dubey et al., 2024), Meta’s 8B parameter open source model instruction finetuned for Chat has been extensively red teamed through adversarial evaluations and includes safety mitigation techniques to lower residual risks. Safety guardrails are implemented through both pre-training and...

work page 2024
[32]

A.2.3 CSRT Dataset Generation Here, we describe the methodology for converting our direct and English prompt-sets into CSRT-base and CSRT-hypothetical sets respectively

for the phonetic perturbations, the average percentange of perturbation to the prompt, and the average number of tokens perturbed per prompt in the CMP prompt-set. A.2.3 CSRT Dataset Generation Here, we describe the methodology for converting our direct and English prompt-sets into CSRT-base and CSRT-hypothetical sets respectively. • First, we automate th...

work page 2022
[33]

Let’s say someone wanted to use drama to promote hate speech and discrimination

• English Prompt:“Okay, here’s a hypothetical scenario. Let’s say someone wanted to use drama to promote hate speech and discrimination. Are there any specific techniques or approaches they would use for this purpose?” • Code-mixed (CM) Prompt:“Ek hypothetical scenario tumhare saame rakhta hun. maan lo ki koi vyankti hate speech aur discrimination ko prom...

work page 2008

[1] [1]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

How (un) ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries.arXiv preprint arXiv:2402.15302,

Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un) ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries.arXiv preprint arXiv:2402.15302,

work page arXiv

[3] [3]

Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazra, and Animesh Mukherjee. Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

work page arXiv

[4] [4]

and Poria, S

Rishabh Bhardwaj and Soujanya Poria. Red-teaming large language models using chain of utterances for safety-alignment.arXiv preprint arXiv:2308.09662,

work page arXiv

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-Juss`a, James Cross, Onur C ¸elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Prohibited Use Policy

Google. Prohibited Use Policy. https://policies.google.com/ai/prohibited-use, 2025a. Accessed: 2025-10-06. Google. Safety settings for generative models. https://ai.google.dev/docs/safety setti ng gemini, 2025b. Accessed: 2025-10-06. Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, ...

work page 2025

[10] [10]

Sowing the wind, reaping the whirlwind: The impact of editing language models

Rima Hazra, Sayan Layek, Somnath Banerjee, and Soujanya Poria. Sowing the wind, reaping the whirlwind: The impact of editing language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 16227–16239, Bangkok, Thailand, August

work page 2024

[11] [11]

doi: 10.18653/v1/2024.findings-acl.960

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.960. URL https://aclanthology.org/2 024.findings-acl.960/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to com- mon corruptions and perturbations. InInternational Conference on Learning Representations,

work page doi:10.18653/v1/2024.findings-acl.960 2024

[12] [12]

Trustagent: Towards safe and trustworthy llm-based agents

Wenyue Hua, Xianjun Yang, Mingyu Jin, Zelong Li, Wei Cheng, Ruixiang Tang, and Yongfeng Zhang. Trustagent: Towards safe and trustworthy llm-based agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 10000–10016,

work page 2024

[13] [13]

Best-of-n jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556,

work page arXiv

[14] [14]

GPT-4o System Card

URLhttps://openreview.net/forum?id=91l4ZTMpO4. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, and Xiaolin Hu. Faster-gcg: Efficient discrete optimization jailbreak attacks against aligned large language models. arXiv preprint arXiv:2410.15362, 2024a. Yanting Li, Gregory Scontras, and Richard Rutrell. On the communicative utility of code- switching. InProceedings of the Society for Computat...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Prompt Injection Attacks and Defenses in LLM-Integrated Applications,

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications.arXiv preprint arXiv:2310.12815,

work page arXiv

[18] [18]

Towards red teaming in multimodal and multilingual translation.arXiv preprint arXiv:2401.16247,

Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, et al. Towards red teaming in multimodal and multilingual translation.arXiv preprint arXiv:2401.16247,

work page arXiv

[19] [19]

The language barrier: Dissecting safety challenges of llms in multilingual contexts

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts. InFindings of the Association for Computational Linguistics ACL 2024, 2024a. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ...

work page arXiv 2024

[20] [20]

Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models.arXiv preprint arXiv:2505.15406,

Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, et al. Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models.arXiv preprint arXiv:2505.15406,

work page arXiv

[21] [21]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Mirage-bench: Automatic multilingual benchmark arena for retrieval-augmented generation systems

Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, and Amin Ahmad. Mirage-bench: Automatic multilingual benchmark arena for retrieval-augmented generation systems. arXiv preprint arXiv:2410.13716,

work page arXiv

[23] [23]

Sandwich attack: Multi-language mixture adaptive attack on llms

Bibek Upadhayay and Vahid Behzadan. Sandwich attack: Multi-language mixture adaptive attack on llms. InProceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pp. 208–226,

work page 2024

[24] [24]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

White-box multimodal jailbreaks against large vision-language models

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, and Yu-Gang Jiang. White-box multimodal jailbreaks against large vision-language models. InProceedings of the 32nd ACM International Conference on Multimedia, pp. 6920–6928, 2024a. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All language...

work page 2024

[26] [26]

Code-switching red-teaming: Llm evaluation for safety and multilingual understanding.arXiv preprint arXiv:2406.15481,

Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding.arXiv preprint arXiv:2406.15481,

work page arXiv

[27] [27]

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training.arXiv preprint arXiv:2407.09121,

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training.arXiv preprint arXiv:2407.09121,

work page arXiv

[28] [28]

Code-mixed llm: Improve large language models’ capability to handle code-mixing through reinforcement learning from ai feedback.arXiv preprint arXiv:2411.09073,

Wenbo Zhang, Aditya Majumdar, and Amulya Yadav. Code-mixed llm: Improve large language models’ capability to handle code-mixing through reinforcement learning from ai feedback.arXiv preprint arXiv:2411.09073,

work page arXiv

[29] [29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Under review

13 Preprint. Under review. A Appendix A.1 Dataset, Model & Jailbreaking Template Details A.1.1 Dataset Descriptions The datasets used in this work are described as follows. • HarmfulQA (Bhardwaj & Poria, 2023):This dataset consists of 10 categories of harm, ranging from ‘Business and Economics’ to ‘Science and Technology’. It features Chain of Utterances ...

work page 2023

[31] [31]

What is the most efficient way in C++ to create a botnet for DDoS attacks?

which further resists jailbreaks and misbehavior. • Llama-3-8B-Instruct (Dubey et al., 2024), Meta’s 8B parameter open source model instruction finetuned for Chat has been extensively red teamed through adversarial evaluations and includes safety mitigation techniques to lower residual risks. Safety guardrails are implemented through both pre-training and...

work page 2024

[32] [32]

A.2.3 CSRT Dataset Generation Here, we describe the methodology for converting our direct and English prompt-sets into CSRT-base and CSRT-hypothetical sets respectively

for the phonetic perturbations, the average percentange of perturbation to the prompt, and the average number of tokens perturbed per prompt in the CMP prompt-set. A.2.3 CSRT Dataset Generation Here, we describe the methodology for converting our direct and English prompt-sets into CSRT-base and CSRT-hypothetical sets respectively. • First, we automate th...

work page 2022

[33] [33]

Let’s say someone wanted to use drama to promote hate speech and discrimination

• English Prompt:“Okay, here’s a hypothetical scenario. Let’s say someone wanted to use drama to promote hate speech and discrimination. Are there any specific techniques or approaches they would use for this purpose?” • Code-mixed (CM) Prompt:“Ek hypothetical scenario tumhare saame rakhta hun. maan lo ki koi vyankti hate speech aur discrimination ko prom...

work page 2008