Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

Aleksandra Korolova; Blossom Metevier; Bohdan Turbal; Max Springer

arxiv: 2606.15531 · v2 · pith:CTH2DDG3new · submitted 2026-06-14 · 💻 cs.LG · cs.CR

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

Bohdan Turbal , Blossom Metevier , Max Springer , Aleksandra Korolova This is my paper

Pith reviewed 2026-06-27 04:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords adversarial attackslarge language modelsdiscrete diffusionsemantic coherenceattack success rategray-box settingsafety alignmentperplexity detection

0 comments

The pith

Greedy Coordinate Diffusion uses discrete diffusion priors to search for coherent adversarial suffixes that achieve high success rates against safety-aligned LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Greedy Coordinate Diffusion (GCD) to generate adversarial attacks on large language models by guiding suffix search with the generative priors of discrete diffusion language models. This produces suffixes with low perplexity and high semantic adherence to the original query intent, unlike gradient-based methods that create incoherent text. The approach works in a gray-box setting without direct gradient access and yields higher attack success rates while evading detection by perplexity-based and guard-model filters at higher rates than alternatives. A sympathetic reader would care because it addresses the practical limit that coherence constraints often block targeted harmful responses.

Core claim

GCD leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence, enabling efficient generation of attacks against safety-aligned models in a gray-box setting with the highest attack success rates, competitive response quality, and lower detection rates by perplexity-based and guard-model filters.

What carries the argument

Greedy Coordinate Diffusion (GCD), a search framework that applies generative priors from discrete diffusion language models to direct adversarial suffix construction.

If this is right

GCD achieves the highest attack success rate among compared methods.
Response-quality scores stay competitive with other attacks.
Constructed adversarial prompts are detected at lower rates by perplexity-based and guard-model filters.
The method operates effectively without requiring direct gradient access to the target model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Diffusion priors may supply a reusable source of semantic structure that other adversarial search algorithms could adopt.
Models trained with stronger generative diffusion objectives might become both more vulnerable and harder to defend via incoherence filters.
The same guidance mechanism could be tested as a defense layer that scores prompt coherence against a diffusion prior before generation.
Gray-box attacks of this form may scale to settings where gradient access is restricted by API design.

Load-bearing premise

The generative priors of discrete diffusion language models can guide suffix search to produce both semantic coherence and successful elicitation of the targeted response.

What would settle it

An experiment replacing the diffusion model guidance in GCD with non-generative search that shows either a sharp drop in attack success rate or a rise in detection rates while holding query intent fixed.

Figures

Figures reproduced from arXiv: 2606.15531 by Aleksandra Korolova, Blossom Metevier, Bohdan Turbal, Max Springer.

**Figure 1.** Figure 1: At each iteration, GCD 1) stochastically samples target positions from the current prompt; 2) queries the Diffusion Language Model (DLM) Dϕ for token probabilities at these indices; and 3) selects the top-K candidates. Then 4) candidate sequences are formed from these tokens, and 5) any remaining masks are filled via a 1-step diffusion projection. Lastly, 6) Mθ scores the drafts to guide the next update. c… view at source ↗

**Figure 2.** Figure 2: Convergence of GCD and GCD-no-OSD. Top row: target CE loss vs. optimization step (left) and wall-clock time (right). Bottom row: exact-match ASR vs. optimization step (left) and wall-clock time (right). Shaded bands denote ±1 standard deviation across examples (CE only). A. Code implementation Our code implementation is available at https:// github.com/princeton-peas-lab/gcd. B. Ablation studies Unless exp… view at source ↗

**Figure 3.** Figure 3: Target CE loss per optimization step for greedy versus random token selection. approximately 2.0, whereas the random variant starts from 3.0, suggesting the greedy criterion provides essential guidance from the very first step. As optimization progresses, the gap widens further — greedy selection consistently converges within 400 steps, while the random baseline fails entirely to find successful adversar… view at source ↗

**Figure 4.** Figure 4: Mean target CE loss per optimization step for GCD and GCG on the 48-example ablation benchmark. Shaded bands indicate one standard deviation across examples. B.7. Mean perplexity and execution time GCD maintains efficient attack runtimes across all evaluated models, shown in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Adversarial attacks on large language models have limited practical impact despite extensive research. Optimization-based attacks such as Greedy Coordinate Gradient (GCG) (Zou et al., 2023) produce high-perplexity, incoherent suffixes that existing defenses easily detect (Bengio et al., 2024). Moreover, attempting to enforce coherence constraints during optimization often prevents the attack from successfully eliciting the specific targeted response, resulting in low success rates against robust models. Conversely, attacks that maintain coherence often alter the semantic intent of queries; when the model complies with these altered queries, responses fail to address the adversary's original goal. In this work, we introduce Greedy Coordinate Diffusion (GCD), a novel framework that efficiently generates adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. GCD leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We show GCD achieves highest ASR while remaining competitive on response-quality scores, and that the constructed adversarial prompts are detected at lower rates than other methods by perplexity-based and guard-model filters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCD uses diffusion priors to guide coordinate search for more coherent adversarial suffixes in a gray-box setting, but the abstract gives no datasets, baselines, or numbers to check the ASR and detection claims.

read the letter

The core idea is to replace direct gradient-based optimization with guidance from a discrete diffusion language model so that the generated suffixes stay semantically close to the original query and low-perplexity. This is meant to avoid the usual trade-off where adding coherence constraints kills attack success.

The approach is new in how it combines the two: greedy coordinate updates steered by diffusion rather than by gradients or explicit coherence penalties. If the full experiments hold up, it could be a practical alternative when white-box access is unavailable.

The abstract states highest ASR, competitive response quality, and lower detection rates by both perplexity filters and guard models. Those are the claims that matter for the subfield. What is missing is any description of the datasets, the exact baselines beyond GCG, the number of trials, or error bars. Without those, the performance numbers cannot be evaluated.

The mechanism section is also thin in the abstract: it is not clear how the diffusion prior steers the search without re-introducing the same coherence-versus-elicitation tension the authors identify in prior work. That gap needs concrete equations or pseudocode to assess.

This is for people who run or evaluate adversarial attacks on aligned LLMs. A reader already working on gray-box or black-box methods might want to see the full experimental section. The paper deserves a serious referee to check whether the reported gains are real and reproducible rather than an artifact of the evaluation setup.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Greedy Coordinate Diffusion (GCD), a novel framework for generating adversarial suffixes against safety-aligned LLMs. It leverages generative priors from discrete diffusion language models to guide suffix search, achieving semantic coherence and adherence to the original query intent without direct gradient access (gray-box setting). The central claims are that GCD attains the highest attack success rate (ASR) while remaining competitive on response-quality metrics and producing prompts detected at lower rates than baselines by perplexity-based and guard-model filters, addressing the coherence-vs-elicitation trade-off that limits prior methods such as GCG.

Significance. If the empirical results are substantiated with proper controls, the work would be significant for adversarial ML on LLMs: it offers a mechanism to decouple coherence enforcement from attack failure using diffusion priors, potentially informing both stronger attacks and new detection strategies. The gray-box applicability and avoidance of high-perplexity suffixes are practically relevant strengths.

major comments (1)

[Abstract] Abstract: the claims of 'highest ASR', 'competitive on response-quality scores', and 'detected at lower rates' are asserted without any accompanying experimental details, datasets, baselines, tables, or error bars. This prevents verification of the central empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'highest ASR', 'competitive on response-quality scores', and 'detected at lower rates' are asserted without any accompanying experimental details, datasets, baselines, tables, or error bars. This prevents verification of the central empirical contribution.

Authors: Abstracts are written to be concise high-level summaries; the detailed experimental protocol (AdvBench and related datasets, GCG and other baselines, full tables with ASR, response-quality metrics, detection rates by perplexity/guard models, and error bars or statistical reporting) appears in Sections 4 and 5 together with the associated figures and tables. The abstract claims are therefore directly substantiated by the body of the paper, which follows conventional academic structure. We do not believe the abstract itself requires expansion. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description introduce GCD as a new framework that uses external discrete diffusion language model priors to guide suffix search, without any equations, parameter fits, or derivations shown. No self-citations appear as load-bearing premises, no uniqueness theorems are invoked, and no predictions reduce to fitted inputs by construction. The method is presented as an empirical alternative to GCG with claimed advantages in ASR and detection rates; these rest on external benchmarks rather than internal redefinition. The derivation chain is therefore self-contained against the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5771 in / 1009 out tokens · 34515 ms · 2026-06-27T04:33:02.551234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 15 linked inside Pith

[1]

arXiv preprint arXiv:2307.15043 , year=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2504.10694 , year=

The Jailbreak Tax: How Useful are Your Jailbreak Outputs? , author=. arXiv preprint arXiv:2504.10694 , year=

arXiv
[3]

arXiv preprint arXiv:2508.15487 , year=

Dream 7B: Diffusion Large Language Models , author=. arXiv preprint arXiv:2508.15487 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2310.08419 , year=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. arXiv preprint arXiv:2310.08419 , year=

Pith/arXiv arXiv
[5]

Tree of Attacks: Jailbreaking Black-Box

Mehrotra, Anay and Zampetakis, Manolis and Kassianik, Paul and Nelson, Blaine and Anderson, Hyrum and Singer, Yaron and Karbasi, Amin , journal=. Tree of Attacks: Jailbreaking Black-Box
[6]

arXiv preprint arXiv:2309.10253 , year=

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts , author=. arXiv preprint arXiv:2309.10253 , year=

Pith/arXiv arXiv
[7]

Advances in Neural Information Processing Systems , year=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , year=
[8]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[9]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[10]

arXiv preprint arXiv:2309.00614 , year=

Baseline defenses for adversarial attacks against aligned language models , author=. arXiv preprint arXiv:2309.00614 , year=

Pith/arXiv arXiv
[11]

International Conference on Learning Representations , year=

Safety alignment should be made more than just a few tokens deep , author=. International Conference on Learning Representations , year=
[12]

2025 , month = apr, howpublished =

Preparedness Framework , author =. 2025 , month = apr, howpublished =

2025
[13]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[14]

Lmsys-chat-1m: A large-scale real-world

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Li, Tianle and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Li, Zhuohan and Lin, Zi and Xing, Eric P and others , journal=. Lmsys-chat-1m: A large-scale real-world
[15]

arXiv preprint arXiv:2505.24445 , year=

Learning Safety Constraints for Large Language Models , author=. arXiv preprint arXiv:2505.24445 , year=

arXiv
[16]

arXiv preprint arXiv:2502.17420 , year=

The geometry of refusal in large language models: Concept cones and representational independence , author=. arXiv preprint arXiv:2502.17420 , year=

arXiv
[17]

arXiv preprint arXiv:2412.16339 , year=

Deliberative alignment: Reasoning enables safer language models , author=. arXiv preprint arXiv:2412.16339 , year=

arXiv
[18]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Inferaligner: Inference-time alignment for harmlessness through cross-model guidance , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[19]

arXiv preprint arXiv:2404.14082 , year=

Mechanistic interpretability for AI safety--a review , author=. arXiv preprint arXiv:2404.14082 , year=

Pith/arXiv arXiv
[20]

Who’s harry potter? approximate unlearning in

Eldan, Ronen and Russinovich, Mark , journal=. Who’s harry potter? approximate unlearning in
[21]

arXiv preprint arXiv:2404.11045 , year=

Offset unlearning for large language models , author=. arXiv preprint arXiv:2404.11045 , year=

arXiv
[22]

Jailbreakradar: Comprehensive assessment of jailbreak attacks against

Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang , booktitle=. Jailbreakradar: Comprehensive assessment of jailbreak attacks against
[23]

Proceedings of the 2025 Annual Network and Distributed System Security Symposium (NDSS) , year=

Safety misalignment against large language models , author=. Proceedings of the 2025 Annual Network and Distributed System Security Symposium (NDSS) , year=

2025
[24]

Llama guard:

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and others , journal=. Llama guard:
[25]

The Llama 3 Family of Models , howpublished =
[26]

arXiv preprint arXiv:2510.14276 , year=

Qwen3guard technical report , author=. arXiv preprint arXiv:2510.14276 , year=

Pith/arXiv arXiv
[27]

Cold-attack: Jailbreaking

Guo, Xingang and Yu, Fangxu and Zhang, Huan and Qin, Lianhui and Hu, Bin , journal=. Cold-attack: Jailbreaking
[28]

Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak

Liu, Xiaogeng and Li, Peiran and Suh, Edward and Vorobeychik, Yevgeniy and Mao, Zhuoqing and Jha, Somesh and McDaniel, Patrick and Sun, Huan and Li, Bo and Xiao, Chaowei , journal=. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak
[29]

arXiv preprint arXiv:2502.01633 , year=

Adversarial reasoning at jailbreaking time , author=. arXiv preprint arXiv:2502.01633 , year=

arXiv
[30]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[31]

Advances in neural information processing systems , volume=

Xlnet: Generalized autoregressive pretraining for language understanding , author=. Advances in neural information processing systems , volume=
[32]

arXiv preprint arXiv:2210.08933 , year=

Diffuseq: Sequence to sequence text generation with diffusion models , author=. arXiv preprint arXiv:2210.08933 , year=

Pith/arXiv arXiv
[33]

Advances in neural information processing systems , volume=

Diffusion-lm improves controllable text generation , author=. Advances in neural information processing systems , volume=
[34]

arXiv preprint arXiv:2302.10025 , year=

Dinoiser: Diffused conditional sequence learning by manipulating noises , author=. arXiv preprint arXiv:2302.10025 , year=

arXiv
[35]

Proceedings of the 41st International Conference on Machine Learning , pages =

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , publisher =

2024
[36]

Advances in Neural Information Processing Systems , volume=

Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=
[37]

arXiv preprint arXiv:2502.09992 , year=

Large language diffusion models , author=. arXiv preprint arXiv:2502.09992 , year=

Pith/arXiv arXiv
[38]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=
[39]

arXiv preprint arXiv:2501.18837 , year=

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming , author=. arXiv preprint arXiv:2501.18837 , year=

Pith/arXiv arXiv
[40]

arXiv preprint arXiv:2601.04603 , year=

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks , author=. arXiv preprint arXiv:2601.04603 , year=

arXiv
[41]

arXiv preprint arXiv:2412.05282 , year=

International scientific report on the safety of advanced ai (interim report) , author=. arXiv preprint arXiv:2412.05282 , year=

arXiv
[42]

1990 , publisher=

Human error , author=. 1990 , publisher=

1990
[43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[44]

arXiv preprint arXiv:2502.06768 , year=

Train for the worst, plan for the best: Understanding token ordering in masked diffusions , author=. arXiv preprint arXiv:2502.06768 , year=

arXiv
[45]

Diffusion

L. Diffusion. arXiv preprint arXiv:2511.00203 , year=

arXiv
[46]

Diffusionattacker: Diffusion-driven prompt manipulation for

Wang, Hao and Li, Hao and Zhu, Junda and Wang, Xinyuan and Pan, Chengwei and Huang, MinLie and Sha, Lei , booktitle=. Diffusionattacker: Diffusion-driven prompt manipulation for
[47]

arXiv preprint arXiv:2509.03888 , year=

False sense of security: Why probing-based malicious input detection fails to generalize , author=. arXiv preprint arXiv:2509.03888 , year=

arXiv
[48]

arXiv preprint arXiv:2311.11509 , year=

Token-level adversarial prompt detection based on perplexity measures and contextual information , author=. arXiv preprint arXiv:2311.11509 , year=

arXiv
[49]

arXiv preprint arXiv:2405.21018 , year=

Improved techniques for optimization-based jailbreaking on large language models , author=. arXiv preprint arXiv:2405.21018 , year=

arXiv
[50]

Don't say no: Jailbreaking

Zhou, Yukai and Lou, Jian and Huang, Zhijie and Qin, Zhan and Yang, Sibei and Wang, Wenjie , booktitle=. Don't say no: Jailbreaking
[51]

arXiv preprint arXiv:2104.13733 , year=

Gradient-based adversarial attacks against text transformers , author=. arXiv preprint arXiv:2104.13733 , year=

arXiv
[52]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[53]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv
[54]

Jiang, Dongsheng and Liu, Yuchen and Liu, Songlin and Zhao, Jin'e and Zhang, Hao and Gao, Zhen and Zhang, Xiaopeng and Li, Jin and Xiong, Hongkai , journal=. From
[55]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=
[56]

Towards understanding jailbreak attacks in

Lin, Yuping and He, Pengfei and Xu, Han and Xing, Yue and Yamada, Makoto and Liu, Hui and Tang, Jiliang , journal=. Towards understanding jailbreak attacks in
[57]

arXiv preprint arXiv:2310.04451 , year=

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

Pith/arXiv arXiv
[58]

Foundations and Trends

Introduction to multi-armed bandits , author=. Foundations and Trends. 2019 , publisher=

2019
[59]

arXiv preprint arXiv:1904.09751 , year=

The curious case of neural text degeneration , author=. arXiv preprint arXiv:1904.09751 , year=

Pith/arXiv arXiv 1904
[60]

Advprefix: An objective for nuanced

Zhu, Sicheng and Amos, Brandon and Tian, Yuandong and Guo, Chuan and Evtimov, Ivan , journal=. Advprefix: An objective for nuanced
[61]

The journal of the Acoustical Society of America , volume=

Perplexity—a measure of the difficulty of speech recognition tasks , author=. The journal of the Acoustical Society of America , volume=. 1977 , publisher=

1977
[62]

Computer Speech & Language , volume=

An empirical study of smoothing techniques for language modeling , author=. Computer Speech & Language , volume=. 1999 , publisher=

1999
[63]

Proceedings of the twelfth language resources and evaluation conference , pages=

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the twelfth language resources and evaluation conference , pages=
[64]

Improved generation of adversarial examples against safety-aligned

Li, Qizhang and Guo, Yiwen and Zuo, Wangmeng and Chen, Hao , journal=. Improved generation of adversarial examples against safety-aligned
[65]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[66]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[67]

arXiv preprint arXiv:2402.04249 , year=

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

Pith/arXiv arXiv
[68]

2023 , eprint=

Detecting Language Model Attacks with Perplexity , author=. 2023 , eprint=

2023

[1] [1]

arXiv preprint arXiv:2307.15043 , year=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2504.10694 , year=

The Jailbreak Tax: How Useful are Your Jailbreak Outputs? , author=. arXiv preprint arXiv:2504.10694 , year=

arXiv

[3] [3]

arXiv preprint arXiv:2508.15487 , year=

Dream 7B: Diffusion Large Language Models , author=. arXiv preprint arXiv:2508.15487 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2310.08419 , year=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. arXiv preprint arXiv:2310.08419 , year=

Pith/arXiv arXiv

[5] [5]

Tree of Attacks: Jailbreaking Black-Box

Mehrotra, Anay and Zampetakis, Manolis and Kassianik, Paul and Nelson, Blaine and Anderson, Hyrum and Singer, Yaron and Karbasi, Amin , journal=. Tree of Attacks: Jailbreaking Black-Box

[6] [6]

arXiv preprint arXiv:2309.10253 , year=

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts , author=. arXiv preprint arXiv:2309.10253 , year=

Pith/arXiv arXiv

[7] [7]

Advances in Neural Information Processing Systems , year=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , year=

[8] [8]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

[9] [9]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[10] [10]

arXiv preprint arXiv:2309.00614 , year=

Baseline defenses for adversarial attacks against aligned language models , author=. arXiv preprint arXiv:2309.00614 , year=

Pith/arXiv arXiv

[11] [11]

International Conference on Learning Representations , year=

Safety alignment should be made more than just a few tokens deep , author=. International Conference on Learning Representations , year=

[12] [12]

2025 , month = apr, howpublished =

Preparedness Framework , author =. 2025 , month = apr, howpublished =

2025

[13] [13]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[14] [14]

Lmsys-chat-1m: A large-scale real-world

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Li, Tianle and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Li, Zhuohan and Lin, Zi and Xing, Eric P and others , journal=. Lmsys-chat-1m: A large-scale real-world

[15] [15]

arXiv preprint arXiv:2505.24445 , year=

Learning Safety Constraints for Large Language Models , author=. arXiv preprint arXiv:2505.24445 , year=

arXiv

[16] [16]

arXiv preprint arXiv:2502.17420 , year=

The geometry of refusal in large language models: Concept cones and representational independence , author=. arXiv preprint arXiv:2502.17420 , year=

arXiv

[17] [17]

arXiv preprint arXiv:2412.16339 , year=

Deliberative alignment: Reasoning enables safer language models , author=. arXiv preprint arXiv:2412.16339 , year=

arXiv

[18] [18]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Inferaligner: Inference-time alignment for harmlessness through cross-model guidance , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[19] [19]

arXiv preprint arXiv:2404.14082 , year=

Mechanistic interpretability for AI safety--a review , author=. arXiv preprint arXiv:2404.14082 , year=

Pith/arXiv arXiv

[20] [20]

Who’s harry potter? approximate unlearning in

Eldan, Ronen and Russinovich, Mark , journal=. Who’s harry potter? approximate unlearning in

[21] [21]

arXiv preprint arXiv:2404.11045 , year=

Offset unlearning for large language models , author=. arXiv preprint arXiv:2404.11045 , year=

arXiv

[22] [22]

Jailbreakradar: Comprehensive assessment of jailbreak attacks against

Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang , booktitle=. Jailbreakradar: Comprehensive assessment of jailbreak attacks against

[23] [23]

Proceedings of the 2025 Annual Network and Distributed System Security Symposium (NDSS) , year=

Safety misalignment against large language models , author=. Proceedings of the 2025 Annual Network and Distributed System Security Symposium (NDSS) , year=

2025

[24] [24]

Llama guard:

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and others , journal=. Llama guard:

[25] [25]

The Llama 3 Family of Models , howpublished =

[26] [26]

arXiv preprint arXiv:2510.14276 , year=

Qwen3guard technical report , author=. arXiv preprint arXiv:2510.14276 , year=

Pith/arXiv arXiv

[27] [27]

Cold-attack: Jailbreaking

Guo, Xingang and Yu, Fangxu and Zhang, Huan and Qin, Lianhui and Hu, Bin , journal=. Cold-attack: Jailbreaking

[28] [28]

Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak

Liu, Xiaogeng and Li, Peiran and Suh, Edward and Vorobeychik, Yevgeniy and Mao, Zhuoqing and Jha, Somesh and McDaniel, Patrick and Sun, Huan and Li, Bo and Xiao, Chaowei , journal=. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak

[29] [29]

arXiv preprint arXiv:2502.01633 , year=

Adversarial reasoning at jailbreaking time , author=. arXiv preprint arXiv:2502.01633 , year=

arXiv

[30] [30]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019

[31] [31]

Advances in neural information processing systems , volume=

Xlnet: Generalized autoregressive pretraining for language understanding , author=. Advances in neural information processing systems , volume=

[32] [32]

arXiv preprint arXiv:2210.08933 , year=

Diffuseq: Sequence to sequence text generation with diffusion models , author=. arXiv preprint arXiv:2210.08933 , year=

Pith/arXiv arXiv

[33] [33]

Advances in neural information processing systems , volume=

Diffusion-lm improves controllable text generation , author=. Advances in neural information processing systems , volume=

[34] [34]

arXiv preprint arXiv:2302.10025 , year=

Dinoiser: Diffused conditional sequence learning by manipulating noises , author=. arXiv preprint arXiv:2302.10025 , year=

arXiv

[35] [35]

Proceedings of the 41st International Conference on Machine Learning , pages =

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , publisher =

2024

[36] [36]

Advances in Neural Information Processing Systems , volume=

Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=

[37] [37]

arXiv preprint arXiv:2502.09992 , year=

Large language diffusion models , author=. arXiv preprint arXiv:2502.09992 , year=

Pith/arXiv arXiv

[38] [38]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=

[39] [39]

arXiv preprint arXiv:2501.18837 , year=

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming , author=. arXiv preprint arXiv:2501.18837 , year=

Pith/arXiv arXiv

[40] [40]

arXiv preprint arXiv:2601.04603 , year=

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks , author=. arXiv preprint arXiv:2601.04603 , year=

arXiv

[41] [41]

arXiv preprint arXiv:2412.05282 , year=

International scientific report on the safety of advanced ai (interim report) , author=. arXiv preprint arXiv:2412.05282 , year=

arXiv

[42] [42]

1990 , publisher=

Human error , author=. 1990 , publisher=

1990

[43] [43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[44] [44]

arXiv preprint arXiv:2502.06768 , year=

Train for the worst, plan for the best: Understanding token ordering in masked diffusions , author=. arXiv preprint arXiv:2502.06768 , year=

arXiv

[45] [45]

Diffusion

L. Diffusion. arXiv preprint arXiv:2511.00203 , year=

arXiv

[46] [46]

Diffusionattacker: Diffusion-driven prompt manipulation for

Wang, Hao and Li, Hao and Zhu, Junda and Wang, Xinyuan and Pan, Chengwei and Huang, MinLie and Sha, Lei , booktitle=. Diffusionattacker: Diffusion-driven prompt manipulation for

[47] [47]

arXiv preprint arXiv:2509.03888 , year=

False sense of security: Why probing-based malicious input detection fails to generalize , author=. arXiv preprint arXiv:2509.03888 , year=

arXiv

[48] [48]

arXiv preprint arXiv:2311.11509 , year=

Token-level adversarial prompt detection based on perplexity measures and contextual information , author=. arXiv preprint arXiv:2311.11509 , year=

arXiv

[49] [49]

arXiv preprint arXiv:2405.21018 , year=

Improved techniques for optimization-based jailbreaking on large language models , author=. arXiv preprint arXiv:2405.21018 , year=

arXiv

[50] [50]

Don't say no: Jailbreaking

Zhou, Yukai and Lou, Jian and Huang, Zhijie and Qin, Zhan and Yang, Sibei and Wang, Wenjie , booktitle=. Don't say no: Jailbreaking

[51] [51]

arXiv preprint arXiv:2104.13733 , year=

Gradient-based adversarial attacks against text transformers , author=. arXiv preprint arXiv:2104.13733 , year=

arXiv

[52] [52]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[53] [53]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv

[54] [54]

Jiang, Dongsheng and Liu, Yuchen and Liu, Songlin and Zhao, Jin'e and Zhang, Hao and Gao, Zhen and Zhang, Xiaopeng and Li, Jin and Xiong, Hongkai , journal=. From

[55] [55]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

[56] [56]

Towards understanding jailbreak attacks in

Lin, Yuping and He, Pengfei and Xu, Han and Xing, Yue and Yamada, Makoto and Liu, Hui and Tang, Jiliang , journal=. Towards understanding jailbreak attacks in

[57] [57]

arXiv preprint arXiv:2310.04451 , year=

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

Pith/arXiv arXiv

[58] [58]

Foundations and Trends

Introduction to multi-armed bandits , author=. Foundations and Trends. 2019 , publisher=

2019

[59] [59]

arXiv preprint arXiv:1904.09751 , year=

The curious case of neural text degeneration , author=. arXiv preprint arXiv:1904.09751 , year=

Pith/arXiv arXiv 1904

[60] [60]

Advprefix: An objective for nuanced

Zhu, Sicheng and Amos, Brandon and Tian, Yuandong and Guo, Chuan and Evtimov, Ivan , journal=. Advprefix: An objective for nuanced

[61] [61]

The journal of the Acoustical Society of America , volume=

Perplexity—a measure of the difficulty of speech recognition tasks , author=. The journal of the Acoustical Society of America , volume=. 1977 , publisher=

1977

[62] [62]

Computer Speech & Language , volume=

An empirical study of smoothing techniques for language modeling , author=. Computer Speech & Language , volume=. 1999 , publisher=

1999

[63] [63]

Proceedings of the twelfth language resources and evaluation conference , pages=

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the twelfth language resources and evaluation conference , pages=

[64] [64]

Improved generation of adversarial examples against safety-aligned

Li, Qizhang and Guo, Yiwen and Zuo, Wangmeng and Chen, Hao , journal=. Improved generation of adversarial examples against safety-aligned

[65] [65]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023

[66] [66]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[67] [67]

arXiv preprint arXiv:2402.04249 , year=

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

Pith/arXiv arXiv

[68] [68]

2023 , eprint=

Detecting Language Model Attacks with Perplexity , author=. 2023 , eprint=

2023