Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
Pith reviewed 2026-05-20 10:03 UTC · model grok-4.3
The pith
Safety alignment in LLMs depends on a small set of sparsely distributed attention heads that optimized obfuscation can bypass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals.
What carries the argument
Mathematical jailbreaking model that maps sparse safety attention heads onto the boundary conditions for successful text obfuscation.
Load-bearing premise
Safety alignment is mechanistically dependent on a small set of sparsely distributed attention heads rather than other distributed or non-attention mechanisms.
What would settle it
Locate the specific attention heads that activate during safety refusals, then test whether targeted disruption of those heads raises the success rate of obfuscated jailbreak prompts while leaving other heads untouched.
Figures
read the original abstract
Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM safety alignment depends on a small set of sparsely distributed attention heads that leave much of the representational space weakly monitored. It formalizes this via a mathematical jailbreaking model characterizing the boundary of effective text obfuscation, then introduces the Babel black-box attack that uses systematic obfuscation sampling with iterative feedback-driven distribution refinement. Experiments report large gains in attack success rate (e.g., GPT-4o from 41.33% to 82.67%, Claude-3-5-haiku from 38.33% to 78.33%) at low query cost (~40 queries) on frontier commercial models.
Significance. If the sparse-head premise and the analytic model are substantiated, the work would supply a mechanistic account of jailbreak success and a practical, query-efficient red-teaming tool. The reported empirical gains over prior black-box methods are substantial and would be valuable for safety evaluation even if the mechanistic interpretation requires further support.
major comments (2)
- [Section 3 (Mathematical Jailbreaking Model) and Section 4 (Babel Framework)] The core premise that safety alignment is mechanistically dependent on a small set of sparsely distributed attention heads is load-bearing for both the mathematical model and the Babel framework, yet the manuscript provides no causal evidence (targeted head ablation, activation patching, or causal tracing) to isolate these heads from distributed representations or non-attention mechanisms. Attack success rates alone cannot distinguish the proposed explanation from alternatives such as token-level safety filters.
- [Section 3.2 (Obfuscation Boundary) and Algorithm 1] The mathematical model is described as analytically characterizing the obfuscation boundary, but the Babel algorithm relies on iterative, response-driven distribution refinement. This feedback loop appears to re-introduce post-hoc adaptation that the model was intended to avoid, undermining the claim that the method is guided by a parameter-free or fully analytic characterization.
minor comments (2)
- [Section 3.1] Clarify the precise definition of 'sparsely distributed' attention heads (e.g., fraction of heads, activation threshold, or layer-wise distribution) and provide the corresponding quantitative measurements.
- [Section 5 (Experiments)] Include standard error bars or multiple random seeds for the reported attack success rates and query counts to allow assessment of variability.
Simulated Author's Rebuttal
We thank the referee for their insightful comments and the opportunity to clarify aspects of our work. Below, we address each major comment in detail. We are committed to improving the manuscript based on this feedback.
read point-by-point responses
-
Referee: [Section 3 (Mathematical Jailbreaking Model) and Section 4 (Babel Framework)] The core premise that safety alignment is mechanistically dependent on a small set of sparsely distributed attention heads is load-bearing for both the mathematical model and the Babel framework, yet the manuscript provides no causal evidence (targeted head ablation, activation patching, or causal tracing) to isolate these heads from distributed representations or non-attention mechanisms. Attack success rates alone cannot distinguish the proposed explanation from alternatives such as token-level safety filters.
Authors: We acknowledge that the manuscript does not provide direct causal evidence such as head ablation or activation patching to confirm the role of specific sparse attention heads. Our approach relies on a mathematical model derived from empirical observations of jailbreak behaviors and the success of the proposed attack in exploiting the hypothesized vulnerability. While we agree that causal interventions would strengthen the mechanistic claims, the current results demonstrate that the model leads to an effective and efficient attack method. We will add a section discussing the limitations of our evidence and outlining potential future work involving mechanistic interpretability techniques to validate the sparse-head hypothesis. revision: yes
-
Referee: [Section 3.2 (Obfuscation Boundary) and Algorithm 1] The mathematical model is described as analytically characterizing the obfuscation boundary, but the Babel algorithm relies on iterative, response-driven distribution refinement. This feedback loop appears to re-introduce post-hoc adaptation that the model was intended to avoid, undermining the claim that the method is guided by a parameter-free or fully analytic characterization.
Authors: The mathematical model provides an analytic characterization of the obfuscation boundary, which informs the initial sampling distribution and the types of obfuscations considered. The iterative refinement in Babel uses black-box feedback to optimize the distribution within the constraints of this boundary, rather than relying on arbitrary heuristics. This hybrid approach allows for query-efficient attacks while remaining grounded in the model's predictions. We do not claim the implementation is parameter-free; the analytic model guides the process, and the feedback enables adaptation to specific models. This is consistent with the framework's design as a practical black-box method. We believe no changes are required here. revision: no
Circularity Check
No circularity detected in derivation chain
full rationale
The paper states an observed phenomenon (safety alignment depending on sparsely distributed attention heads) as the starting point for investigation, then introduces a mathematical model to characterize obfuscation boundaries and guide an empirical attack method. The attack relies on iterative, feedback-driven sampling from model responses rather than any fitted parameter or self-referential equation that reduces the claimed results back to the inputs by construction. No equations or sections in the provided abstract or description exhibit self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The framework is self-contained against external benchmarks via reported attack success rates on frontier models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety alignment in LLMs is implemented primarily through a small set of sparsely distributed attention heads.
Reference graph
Works this paper leans on
-
[1]
George Casella and Roger Berger.Statistical inference
doi: 10.48550/arxiv.2403.01976. George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC,
-
[2]
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang
doi: 10.48550/arxiv.2406.14144. Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Compre- hensive assessment of jailbreak attacks against llms.arXiv.org, 2024a. doi: 10.48550/arxiv.2402. 05668. Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jailbreakradar: Comprehensive assessment of jailb...
-
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: ...
work page 2019
-
[4]
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556,
-
[5]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decom- position and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914,
-
[7]
Runqi Lin, Bo Han, Fengwang Li, and Tongling Liu. Understanding and enhancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052,
-
[8]
InInter- national Conference on Learning Representations (ICLR)
Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv.org, 2024a. doi: 10.48550/arxiv.2406. 03712. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Genera...
-
[9]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
doi: 10.48550/arxiv.2310.04451. Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv.org, 2024b. doi: 10.48550/arxiv.2410.02832. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evalua...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.04451
-
[10]
OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf
URL https://cdn. openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf. 11 Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, and Zaisheng Ye. Feint and attack: Attention-based strategies for jailbreaking and protecting llms.arXiv preprint arXiv:2410.16327,
-
[13]
Safeguider: Robust and practical content safety control for text-to-image models
Peigui Qi, Kunsheng Tang, Wenbo Zhou, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, and Jie Zhang. Safeguider: Robust and practical content safety control for text-to-image models. arXiv preprint arXiv:2510.05173,
-
[14]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL https://www.reddit.com/r/GPT_jailbreaks/comments/ 1oqrb1k/new_jailbreak/. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, and Jiexi Liu. Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink.arXiv preprint arXiv:2501.15269, 1,
-
[16]
doi: 10.48550/arxiv.2402.05162. xAI. Grok 4 model card,
-
[17]
URL https://data.x.ai/2025-08-20-grok-4-model-card. pdf. Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Qi Li, and Jiangyu Lei. from benign import toxic: Jailbreaking the language model via adversarial metaphors.arXiv.org,
work page 2025
-
[18]
Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang
doi: 10.48550/arxiv.2503.00038. Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. Guiding not forcing: Enhancing the transferability of jailbreaking attacks on llms via removing superfluous constraints. arXiv.org, 2025a. doi: 10.48550/arxiv.2503.01865. Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. Guiding ...
-
[19]
doi: 10.48550/arxiv.2412. 12621. Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G. Parker, and M. de Choudhury. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. International Conference on Human Factors in Computing Systems,
-
[20]
doi: 10.1145/3544548. 3581318. Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states.arXiv preprint arXiv:2406.05644, 2024a. Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. On t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.