Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Cong Wu; Jing Chen; Ju Jia; Ruichao Liang; Ruiying Du; Yang Liu; Yebo Feng; Zhi Wang; Ziwei Wang

arxiv: 2605.17971 · v1 · pith:Z3LQJZNNnew · submitted 2026-05-18 · 💻 cs.CR · cs.AI

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Ziwei Wang , Jing Chen , Ruichao Liang , Zhi Wang , Yebo Feng , Ju Jia , Ruiying Du , Cong Wu

show 1 more author

Yang Liu

This is my paper

Pith reviewed 2026-05-20 10:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreakingsafety alignmentattention headsobfuscationblack-box attackLLM vulnerabilitiesadversarial sampling

0 comments

The pith

Safety alignment in LLMs depends on a small set of sparsely distributed attention heads that optimized obfuscation can bypass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that safety alignment in large language models concentrates in only a small number of attention heads scattered sparsely across the network, so most internal representations receive little direct oversight. It supports this view with a mathematical model that draws the precise boundary separating effective from ineffective text obfuscation and uses the boundary to explain common jailbreak patterns. The model then directs the design of an attack system that refines obfuscation choices step by step using feedback from prior attempts. A sympathetic reader would care because the account replaces blind trial-and-error with a low-query, black-box procedure that exposes safety gaps in frontier models more reliably than earlier methods.

Core claim

Safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals.

What carries the argument

Mathematical jailbreaking model that maps sparse safety attention heads onto the boundary conditions for successful text obfuscation.

Load-bearing premise

Safety alignment is mechanistically dependent on a small set of sparsely distributed attention heads rather than other distributed or non-attention mechanisms.

What would settle it

Locate the specific attention heads that activate during safety refusals, then test whether targeted disruption of those heads raises the success rate of obfuscated jailbreak prompts while leaving other heads untouched.

Figures

Figures reproduced from arXiv: 2605.17971 by Cong Wu, Jing Chen, Ju Jia, Ruichao Liang, Ruiying Du, Yang Liu, Yebo Feng, Zhi Wang, Ziwei Wang.

**Figure 2.** Figure 2: Distribution of obfuscation degrees of the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Functional relationship between ASR and n for different numbers of harmful queries m. 4 Methodology In this section, we propose Babel, a jailbreak attack framework. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The framwork of our method Babel. 4.1 Harmful Query Obfuscation To find the distribution with the highest probability density within the jailbreaking interval, we propose a systematic obfuscation method that generates different obfuscation degree distributions. Generally, string perturbations have a smaller impact on LLM comprehension than token perturbations. The obfuscation targets are grouped into three… view at source ↗

**Figure 5.** Figure 5: The results of harmful query embedding ablation experiments [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The results of distribution-optimizedsampling ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: The distribution of obfuscation degree of successful samples. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of obfuscation degree of successful samples without harmful query embedding. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM safety alignment depends on a small set of sparsely distributed attention heads that leave much of the representational space weakly monitored. It formalizes this via a mathematical jailbreaking model characterizing the boundary of effective text obfuscation, then introduces the Babel black-box attack that uses systematic obfuscation sampling with iterative feedback-driven distribution refinement. Experiments report large gains in attack success rate (e.g., GPT-4o from 41.33% to 82.67%, Claude-3-5-haiku from 38.33% to 78.33%) at low query cost (~40 queries) on frontier commercial models.

Significance. If the sparse-head premise and the analytic model are substantiated, the work would supply a mechanistic account of jailbreak success and a practical, query-efficient red-teaming tool. The reported empirical gains over prior black-box methods are substantial and would be valuable for safety evaluation even if the mechanistic interpretation requires further support.

major comments (2)

[Section 3 (Mathematical Jailbreaking Model) and Section 4 (Babel Framework)] The core premise that safety alignment is mechanistically dependent on a small set of sparsely distributed attention heads is load-bearing for both the mathematical model and the Babel framework, yet the manuscript provides no causal evidence (targeted head ablation, activation patching, or causal tracing) to isolate these heads from distributed representations or non-attention mechanisms. Attack success rates alone cannot distinguish the proposed explanation from alternatives such as token-level safety filters.
[Section 3.2 (Obfuscation Boundary) and Algorithm 1] The mathematical model is described as analytically characterizing the obfuscation boundary, but the Babel algorithm relies on iterative, response-driven distribution refinement. This feedback loop appears to re-introduce post-hoc adaptation that the model was intended to avoid, undermining the claim that the method is guided by a parameter-free or fully analytic characterization.

minor comments (2)

[Section 3.1] Clarify the precise definition of 'sparsely distributed' attention heads (e.g., fraction of heads, activation threshold, or layer-wise distribution) and provide the corresponding quantitative measurements.
[Section 5 (Experiments)] Include standard error bars or multiple random seeds for the reported attack success rates and query counts to allow assessment of variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and the opportunity to clarify aspects of our work. Below, we address each major comment in detail. We are committed to improving the manuscript based on this feedback.

read point-by-point responses

Referee: [Section 3 (Mathematical Jailbreaking Model) and Section 4 (Babel Framework)] The core premise that safety alignment is mechanistically dependent on a small set of sparsely distributed attention heads is load-bearing for both the mathematical model and the Babel framework, yet the manuscript provides no causal evidence (targeted head ablation, activation patching, or causal tracing) to isolate these heads from distributed representations or non-attention mechanisms. Attack success rates alone cannot distinguish the proposed explanation from alternatives such as token-level safety filters.

Authors: We acknowledge that the manuscript does not provide direct causal evidence such as head ablation or activation patching to confirm the role of specific sparse attention heads. Our approach relies on a mathematical model derived from empirical observations of jailbreak behaviors and the success of the proposed attack in exploiting the hypothesized vulnerability. While we agree that causal interventions would strengthen the mechanistic claims, the current results demonstrate that the model leads to an effective and efficient attack method. We will add a section discussing the limitations of our evidence and outlining potential future work involving mechanistic interpretability techniques to validate the sparse-head hypothesis. revision: yes
Referee: [Section 3.2 (Obfuscation Boundary) and Algorithm 1] The mathematical model is described as analytically characterizing the obfuscation boundary, but the Babel algorithm relies on iterative, response-driven distribution refinement. This feedback loop appears to re-introduce post-hoc adaptation that the model was intended to avoid, undermining the claim that the method is guided by a parameter-free or fully analytic characterization.

Authors: The mathematical model provides an analytic characterization of the obfuscation boundary, which informs the initial sampling distribution and the types of obfuscations considered. The iterative refinement in Babel uses black-box feedback to optimize the distribution within the constraints of this boundary, rather than relying on arbitrary heuristics. This hybrid approach allows for query-efficient attacks while remaining grounded in the model's predictions. We do not claim the implementation is parameter-free; the analytic model guides the process, and the feedback enables adaptation to specific models. This is consistent with the framework's design as a practical black-box method. We believe no changes are required here. revision: no

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper states an observed phenomenon (safety alignment depending on sparsely distributed attention heads) as the starting point for investigation, then introduces a mathematical model to characterize obfuscation boundaries and guide an empirical attack method. The attack relies on iterative, feedback-driven sampling from model responses rather than any fitted parameter or self-referential equation that reduces the claimed results back to the inputs by construction. No equations or sections in the provided abstract or description exhibit self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The framework is self-contained against external benchmarks via reported attack success rates on frontier models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that safety is localized to sparse attention heads; the mathematical model is presented as a new formalization rather than a derivation from first principles.

axioms (1)

domain assumption Safety alignment in LLMs is implemented primarily through a small set of sparsely distributed attention heads.
This premise is stated as the intrinsic vulnerability being investigated and is used to motivate the mathematical jailbreaking model.

pith-pipeline@v0.9.0 · 5778 in / 1223 out tokens · 45793 ms · 2026-05-20T10:03:56.100264+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

[1]

George Casella and Roger Berger.Statistical inference

doi: 10.48550/arxiv.2403.01976. George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC,

work page doi:10.48550/arxiv.2403.01976
[2]

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang

doi: 10.48550/arxiv.2406.14144. Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Compre- hensive assessment of jailbreak attacks against llms.arXiv.org, 2024a. doi: 10.48550/arxiv.2402. 05668. Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jailbreakradar: Comprehensive assessment of jailb...

work page doi:10.48550/arxiv.2406.14144 2025
[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: ...

work page 2019
[4]

Best-of-n jailbreaking,

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556,

work page arXiv
[5]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Drattack: Prompt decom- position and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914,

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decom- position and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914,

work page arXiv
[7]

Understanding and enhancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052,

Runqi Lin, Bo Han, Fengwang Li, and Tongling Liu. Understanding and enhancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052,

work page arXiv
[8]

InInter- national Conference on Learning Representations (ICLR)

Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv.org, 2024a. doi: 10.48550/arxiv.2406. 03712. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Genera...

work page doi:10.48550/arxiv.2406
[9]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

doi: 10.48550/arxiv.2310.04451. Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv.org, 2024b. doi: 10.48550/arxiv.2410.02832. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evalua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.04451
[10]

OpenAI o1 System Card

OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf

URL https://cdn. openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf. 11 Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, and Zaisheng Ye. Feint and attack: Attention-based strategies for jailbreaking and protecting llms.arXiv preprint arXiv:2410.16327,

work page arXiv
[13]

Safeguider: Robust and practical content safety control for text-to-image models

Peigui Qi, Kunsheng Tang, Wenbo Zhou, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, and Jie Zhang. Safeguider: Robust and practical content safety control for text-to-image models. arXiv preprint arXiv:2510.05173,

work page arXiv
[14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://www.reddit.com/r/GPT_jailbreaks/comments/ 1oqrb1k/new_jailbreak/. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink.arXiv preprint arXiv:2501.15269, 1,

Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, and Jiexi Liu. Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink.arXiv preprint arXiv:2501.15269, 1,

work page arXiv
[16]

doi: 10.48550/arxiv.2402.05162. xAI. Grok 4 model card,

work page doi:10.48550/arxiv.2402.05162
[17]

URL https://data.x.ai/2025-08-20-grok-4-model-card. pdf. Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Qi Li, and Jiangyu Lei. from benign import toxic: Jailbreaking the language model via adversarial metaphors.arXiv.org,

work page 2025
[18]

Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang

doi: 10.48550/arxiv.2503.00038. Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. Guiding not forcing: Enhancing the transferability of jailbreaking attacks on llms via removing superfluous constraints. arXiv.org, 2025a. doi: 10.48550/arxiv.2503.01865. Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. Guiding ...

work page doi:10.48550/arxiv.2503.00038
[19]

doi: 10.48550/arxiv.2412. 12621. Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G. Parker, and M. de Choudhury. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. International Conference on Human Factors in Computing Systems,

work page doi:10.48550/arxiv.2412
[20]

doi: 10.1145/3544548. 3581318. Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states.arXiv preprint arXiv:2406.05644, 2024a. Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. On t...

work page doi:10.1145/3544548 2023

[1] [1]

George Casella and Roger Berger.Statistical inference

doi: 10.48550/arxiv.2403.01976. George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC,

work page doi:10.48550/arxiv.2403.01976

[2] [2]

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang

doi: 10.48550/arxiv.2406.14144. Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Compre- hensive assessment of jailbreak attacks against llms.arXiv.org, 2024a. doi: 10.48550/arxiv.2402. 05668. Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jailbreakradar: Comprehensive assessment of jailb...

work page doi:10.48550/arxiv.2406.14144 2025

[3] [3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: ...

work page 2019

[4] [4]

Best-of-n jailbreaking,

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556,

work page arXiv

[5] [5]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Drattack: Prompt decom- position and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914,

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decom- position and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914,

work page arXiv

[7] [7]

Understanding and enhancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052,

Runqi Lin, Bo Han, Fengwang Li, and Tongling Liu. Understanding and enhancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052,

work page arXiv

[8] [8]

InInter- national Conference on Learning Representations (ICLR)

Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv.org, 2024a. doi: 10.48550/arxiv.2406. 03712. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Genera...

work page doi:10.48550/arxiv.2406

[9] [9]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

doi: 10.48550/arxiv.2310.04451. Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv.org, 2024b. doi: 10.48550/arxiv.2410.02832. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evalua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.04451

[10] [10]

OpenAI o1 System Card

OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf

URL https://cdn. openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf. 11 Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, and Zaisheng Ye. Feint and attack: Attention-based strategies for jailbreaking and protecting llms.arXiv preprint arXiv:2410.16327,

work page arXiv

[12] [13]

Safeguider: Robust and practical content safety control for text-to-image models

Peigui Qi, Kunsheng Tang, Wenbo Zhou, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, and Jie Zhang. Safeguider: Robust and practical content safety control for text-to-image models. arXiv preprint arXiv:2510.05173,

work page arXiv

[13] [14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://www.reddit.com/r/GPT_jailbreaks/comments/ 1oqrb1k/new_jailbreak/. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink.arXiv preprint arXiv:2501.15269, 1,

Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, and Jiexi Liu. Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink.arXiv preprint arXiv:2501.15269, 1,

work page arXiv

[15] [16]

doi: 10.48550/arxiv.2402.05162. xAI. Grok 4 model card,

work page doi:10.48550/arxiv.2402.05162

[16] [17]

URL https://data.x.ai/2025-08-20-grok-4-model-card. pdf. Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Qi Li, and Jiangyu Lei. from benign import toxic: Jailbreaking the language model via adversarial metaphors.arXiv.org,

work page 2025

[17] [18]

Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang

doi: 10.48550/arxiv.2503.00038. Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. Guiding not forcing: Enhancing the transferability of jailbreaking attacks on llms via removing superfluous constraints. arXiv.org, 2025a. doi: 10.48550/arxiv.2503.01865. Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, and Minlie Huang. Guiding ...

work page doi:10.48550/arxiv.2503.00038

[18] [19]

doi: 10.48550/arxiv.2412. 12621. Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G. Parker, and M. de Choudhury. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. International Conference on Human Factors in Computing Systems,

work page doi:10.48550/arxiv.2412

[19] [20]

doi: 10.1145/3544548. 3581318. Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states.arXiv preprint arXiv:2406.05644, 2024a. Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. On t...

work page doi:10.1145/3544548 2023