arxiv: 2605.07250 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

Zhixue Song , Boyan Han , Yiwei Wang , Chi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords jailbreakmultimodal LLMsafety alignmentimage resolutioncognitive overloadvisual perturbationmodel vulnerabilitycontext compression

0 comments

The pith

Lowering image resolution causes multimodal language models to ignore their safety alignments and follow harmful instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that reducing the resolution of images containing text prompts causes state-of-the-art multimodal models to produce prohibited outputs in response to jailbreak attempts. The safety failures occur even when the text remains legible to humans. The authors trace the breakdown to the extra processing effort required to read the degraded image, which leaves insufficient capacity for safety checks. The pattern holds across multiple visual changes such as noise and distortion, and it affects common compression approaches used for long contexts. If the observation is accurate, it means efficiency techniques that render text as lower-quality images can create new routes around existing safeguards.

Core claim

Lowering image resolution causes the safety defenses of current multimodal large language models to fail sharply, allowing jailbreaks even when the text remains legible to humans. The authors attribute this to cognitive overload from the extra effort needed to process the degraded visual input, which leaves fewer resources for safety auditing. The effect appears with other visual changes such as added noise or geometric distortions. They demonstrate the pattern holds across multiple leading models and introduce a structured processing method that separates transcription from safety evaluation to reduce the vulnerability.

What carries the argument

Cognitive Overload hypothesis, in which the effort to read degraded visual text consumes resources that would otherwise support safety auditing.

If this is right

Safety performance declines steadily as image resolution drops, even past the point where text stays readable.
The same safety bypass occurs with added noise and geometric distortions applied to the image.
A serialized pipeline that requires transcription before safety assessment reduces successful jailbreaks.
The vulnerability shows up consistently in multiple state-of-the-art models examined in the experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety testing for these models should routinely include inputs at several different resolutions to detect hidden bypasses.
Architectures that keep visual transcription and safety evaluation fully separate may become necessary for compressed inputs.
The overload pattern could appear with other resource-heavy inputs such as complex layouts or low-contrast text.
Compression methods that preserve higher effective readability might avoid the safety cost without sacrificing efficiency.

Load-bearing premise

The work of reading a degraded image draws resources away from safety assessment rather than some other mechanism such as general loss of understanding.

What would settle it

An experiment that forces the model to output a clear transcription of the low-resolution image first and then evaluate the request for harm, checking whether jailbreak rates remain high.

Figures

Figures reproduced from arXiv: 2605.07250 by Boyan Han, Chi Zhang, Yiwei Wang, Zhixue Song.

**Figure 2.** Figure 2: Illustration of Cognitive Overload Attack and Defense. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The “Attack Comfort Zone” Phenomenon in Multimodal Large Language Models. As DPI decreases, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Layer wise Safety Probing. The density plots (top) reveal that ACZ inputs (orange) mimic harmless [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Elimination of the Attack Comfort Zone via [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qwen3VL-32B-Thinking t-SNE visualization of latent representations across different resolutions. The significant overlap between ACZ and high-fidelity clusters refutes the OOD hypothesis, indicating ACZ inputs are processed as valid visual signals. Distribution Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple ``Structured Cognitive Offloading'' strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that visual degradation of images (e.g., lowered resolution, noise, geometric distortion) in MLLMs bypasses safety alignments and increases jailbreak success rates, even when the embedded text remains human-legible. The authors attribute this to a 'Cognitive Overload' mechanism that diverts attentional resources from safety auditing and propose 'Structured Cognitive Offloading' (a serialized transcription-then-assessment pipeline) as mitigation. The effect is reported as consistent across multiple perturbation types.

Significance. If the empirical pattern holds with proper quantification and mechanistic validation, the result would be significant for MLLM security, especially for vision-based long-context compression techniques. The consistency across resolution drops, noise, and distortion is a strength that could inform safer model design. However, the absence of success-rate numbers, baselines, and tests distinguishing overload from encoder distribution shift limits the finding's immediate actionability.

major comments (3)

[Abstract] Abstract: the claim that 'safety defenses of SOTA models deteriorate sharply' is presented without any quantitative jailbreak success rates, non-degraded baselines, or statistical controls, preventing assessment of effect size or reliability.
[Hypothesis and discussion sections] Cognitive Overload hypothesis: the proposed mechanism (effort to decipher degraded inputs diverts resources from safety auditing) is stated without direct evidence such as attention maps, token-level probing, or transcription ablations; this leaves the alternative explanation of vision-encoder distribution shift unaddressed and untested.
[Mitigation section] Structured Cognitive Offloading mitigation: the strategy is motivated by the untested overload account and does not include experiments showing that serialization restores safety auditing (as opposed to simply restoring high-fidelity embeddings).

minor comments (2)

[Experimental setup] Provide explicit definitions and operationalizations for 'Cognitive Overload' and the exact perturbation parameters (resolution levels, noise variance, distortion angles) used in experiments.
[Results] Include error bars, number of trials, and model-specific results in all figures and tables reporting jailbreak rates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas for strengthening the presentation of quantitative results, the mechanistic discussion, and the validation of the proposed mitigation. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'safety defenses of SOTA models deteriorate sharply' is presented without any quantitative jailbreak success rates, non-degraded baselines, or statistical controls, preventing assessment of effect size or reliability.

Authors: We agree that the abstract should include quantitative support to allow readers to assess effect size. The full manuscript reports jailbreak success rates across multiple SOTA MLLMs for degraded versus full-resolution inputs, along with baselines and controls. We have revised the abstract to incorporate key quantitative findings from our experiments, including representative success-rate increases and consistency across models. revision: yes
Referee: [Hypothesis and discussion sections] Cognitive Overload hypothesis: the proposed mechanism (effort to decipher degraded inputs diverts resources from safety auditing) is stated without direct evidence such as attention maps, token-level probing, or transcription ablations; this leaves the alternative explanation of vision-encoder distribution shift unaddressed and untested.

Authors: Our Cognitive Overload hypothesis is supported by the persistence of elevated jailbreak rates across diverse perturbations (resolution, noise, geometric distortion) even when text remains human-legible. We acknowledge the absence of direct mechanistic probes such as attention maps. In revision we have expanded the discussion section to explicitly contrast the overload account against vision-encoder distribution shift, using the experimental design (e.g., perturbations that do not uniformly shift encoder distributions) as supporting evidence. We also added transcription-ablation results to provide further indirect support for the proposed mechanism. revision: partial
Referee: [Mitigation section] Structured Cognitive Offloading mitigation: the strategy is motivated by the untested overload account and does not include experiments showing that serialization restores safety auditing (as opposed to simply restoring high-fidelity embeddings).

Authors: We have added new experiments that compare the Structured Cognitive Offloading pipeline against both direct degraded-image input and controls that restore high-fidelity embeddings without serialization. The results indicate that the serialized transcription-then-assessment approach yields additional safety gains beyond embedding quality alone. These comparative experiments are now reported in the revised mitigation section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observation with non-derivational hypothesis

full rationale

The paper's core contribution consists of experimental measurements showing that MLLM refusal rates drop as input image resolution, noise, or distortion increases, even when the underlying text remains human-legible. Attribution to a 'Cognitive Overload' hypothesis is explicitly labeled as such and is not derived from any equations, fitted parameters, or self-citations that would render the claim tautological. No self-definitional constructs, predictions obtained by refitting the same data, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the provided text. The work is therefore self-contained as an observational study; the result does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on standard multimodal model assumptions plus one new hypothesized mechanism without independent evidence.

axioms (1)

domain assumption MLLMs jointly process visual and textual tokens in a shared attention mechanism
Invoked to support the cognitive overload explanation

invented entities (1)

Cognitive Overload no independent evidence
purpose: Explains why degraded visual input reduces safety auditing capacity
Postulated mechanism; no direct measurement or falsifiable test provided in abstract

pith-pipeline@v0.9.0 · 5451 in / 1118 out tokens · 42976 ms · 2026-05-11T01:45:48.361960+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

[4]

Safety Layers in Aligned Large Language Models: The Key to

Shen Li and Liuyi Yao and Lan Zhang and Yaliang Li , booktitle=. Safety Layers in Aligned Large Language Models: The Key to. 2025 , url=

work page 2025
[7]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops , month =

Schlarmann, Christian and Hein, Matthias , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops , month =. 2023 , pages =

work page 2023
[8]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Bailey, Luke and Ong, Euan and Russell, Stuart and Emmons, Scott , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[10]

First Conference on Language Modeling , year=

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks , author=. First Conference on Language Modeling , year=

work page
[12]

Proceedings of the 33rd USENIX Conference on Security Symposium , articleno =

Liu, Tong and Zhang, Yingjie and Zhao, Zhe and Dong, Yinpeng and Meng, Guozhu and Chen, Kai , title =. Proceedings of the 33rd USENIX Conference on Security Symposium , articleno =. 2024 , isbn =

work page 2024
[14]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[15]

2025 , month =

GPT-4.1 , author =. 2025 , month =

work page 2025
[16]

2025 , url =

Claude 4.5 Sonnet , author =. 2025 , url =

work page 2025
[17]

2025 , url =

Claude 4.5 Haiku , author =. 2025 , url =

work page 2025
[20]

2025 , howpublished =

ByteDance Seed Team , title =. 2025 , howpublished =

work page 2025
[21]

2025 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

Intern-S1: A Scientific Multimodal Foundation Model , author=. 2025 , eprint=

work page 2025
[23]

2024 , eprint=

DeepSeek-V3 Technical Report , author=. 2024 , eprint=

work page 2024
[24]

2025 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=

work page 2025
[27]

2023 , eprint=

How Robust is Google's Bard to Adversarial Image Attacks? , author=. 2023 , eprint=

work page 2023
[28]

2023 , eprint=

A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation , author=. 2023 , eprint=

work page 2023
[29]

2023 , eprint=

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models , author=. 2023 , eprint=

work page 2023
[30]

2023 , eprint=

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks , author=. 2023 , eprint=

work page 2023
[31]

Anthropic . 2025 a . https://platform.claude.com/docs/en/release-notes/overview Claude 4.5 haiku . Large language model, snapshot 2025-10-01

work page 2025
[32]

Anthropic . 2025 b . https://www.anthropic.com/claude-sonnet-4-5-system-card Claude 4.5 sonnet . Large language model, snapshot 2025-09-29

work page 2025
[33]

Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, and 156 others. 2025 a . https://arxiv.org/abs/2508.15763 Intern-s1: A scientific multimodal foundation model . Preprint, arXiv:2508.15763

work page arXiv 2025
[34]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025 b . Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2024. Image hijacks: adversarial images can control generative models at runtime. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

work page 2024
[36]

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, and Minlie Huang. 2025. https://doi.org/10.1145/3746027.3754561 Jps: Jailbreak multimodal large language models with collaborative visual perturbation and textual steering . In Proceedings of the 33rd ACM International C...

work page doi:10.1145/3746027.3754561 2025
[37]

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang. 2025. Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800

work page arXiv 2025
[38]

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. 2023. https://arxiv.org/abs/2312.03777 On the robustness of large multimodal models against image adversarial attacks . Preprint, arXiv:2312.03777

work page arXiv 2023
[39]

DeepSeek-AI. 2024. https://arxiv.org/abs/2412.19437 Deepseek-v3 technical report . Preprint, arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. 2023. https://arxiv.org/abs/2309.11751 How robust is google's bard to adversarial image attacks? Preprint, arXiv:2309.11751

work page arXiv 2023
[41]

Google Gemini Team. 2025. arxiv.org Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities . arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. https://doi.org/10.1609/aaai.v39i22.34568 Figstep: jailbreaking large vision-language models via typographic visual prompts . In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovat...

work page doi:10.1609/aaai.v39i22.34568 2025
[43]

Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, Kaiwen Cai, Yanghao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, Andre Freitas, and Mustafa A. Mustafa. 2023. https://arxiv.org/abs/2305.11391 A survey of safety and trustworthiness of large language models through the lens of verification ...

work page arXiv 2023
[44]

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. 2025. https://openreview.net/forum?id=kUH1yPMAn7 Safety layers in aligned large language models: The key to LLM security . In The Thirteenth International Conference on Learning Representations

work page 2025
[45]

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, and Kai Chen. 2024. Making them ask and answer: jailbreaking large language models in few queries via disguise and reconstruction. In Proceedings of the 33rd USENIX Conference on Security Symposium, SEC '24, USA. USENIX Association

work page 2024
[46]

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. 2024. https://openreview.net/forum?id=GC4mXVfquq Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks . In First Conference on Language Modeling

work page 2024
[47]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

work page 2024
[48]

OpenAI . 2025. https://openai.com/index/gpt-4-1/ Gpt-4.1 . Large language model, version 2025-04-14

work page 2025
[49]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. https://doi.org/10.1609/aaai.v38i19.30150 Visual adversarial examples jailbreak aligned large language models . In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificia...

work page doi:10.1609/aaai.v38i19.30150 2024
[51]

Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2023 b . https://arxiv.org/abs/2307.14539 Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models . Preprint, arXiv:2307.14539

work page arXiv 2023
[52]

ByteDance Seed Team. 2025. https://seed.bytedance.com/en/seed1_6 Seed 1.6: Pushing the frontiers of multimodal reasoning with adaptive chain-of-thought . Technical Report, ByteDance. Version doubao-seed-1.6-251015 released Oct 2025

work page 2025
[53]

GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, and 152 others. 2025 a . https://arxiv.org/abs/2508.06471 Glm-4.5: Agentic, reasoning, and coding (arc) foundation models . Prepr...

work page internal anchor Pith review arXiv 2025
[54]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, and 150 others. 2025 b . https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, and 69 others. 2025 c . https://arxiv.org/abs/2507.01006 Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable r...

work page internal anchor Pith review arXiv 2025
[56]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234

work page internal anchor Pith review arXiv 2025
[57]

Shuo Xing, Lanqing Guo, Hongyuan Hua, Seoyoung Lee, Peiran Li, Yufei Wang, Zhangyang Wang, and Zhengzhong Tu. 2025. Demystifying the visual quality paradox in multimodal large language models. arXiv preprint arXiv:2506.15645

work page arXiv 2025
[58]

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.293 Defending large language models against jailbreak attacks via layer-specific editing . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5094--5109, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.293 2024
[59]

Weixiong Zheng, Peijian Zeng, YiWei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, and Yongmei Zhou. 2025. https://doi.org/10.18653/v1/2025.acl-long.570 Jailbreaking? one step is enough! In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11623--11642, Vienna, Austria. Association...

work page doi:10.18653/v1/2025.acl-long.570 2025
[60]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023