pith. sign in

arxiv: 2606.04027 · v1 · pith:7BRYTHDVnew · submitted 2026-06-01 · 💻 cs.CR · cs.AI

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Pith reviewed 2026-06-28 13:41 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreakingdiffusion large language modelsadversarial attacksblack-box attacksstructural patternsUCB banditred-teamingAI safety
0
0 comments X

The pith

MaskForge abstracts jailbreaks into reusable structural patterns and selects them with a UCB bandit to attack diffusion LLMs at 79.3 percent average success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MaskForge as a black-box adaptive attack that treats jailbreaking of diffusion large language models as search over a library of structural patterns derived from prior successes. dLLMs generate by iteratively denoising masked sequences under bidirectional context, which creates an infilling route for harmful content outside monitored prefixes that standard autoregressive jailbreaks do not exploit. MaskForge abstracts successful attempts into schemas, chooses goal-compatible ones via UCB bandit, falls back to a scorer when the library is insufficient, and adds new successes back into the library so experience accumulates across goals. This produces an average 79.3 percent attack success rate on five public dLLMs and three benchmarks, a 17.6 percent relative gain over the best prior dLLM baseline, and transfers the matured library to AdvBench for 88.2 percent success without further changes.

Core claim

MaskForge casts dLLM red-teaming as optimized search over a growing library of structural patterns. It abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3 percent, a 17.6 percent relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving an 88.2 percent attack succ

What carries the argument

the growing library of structural patterns abstracted from successful jailbreaks, selected by UCB bandit with scorer-guided fallback and updated by distilling new successes

If this is right

  • The method reaches 79.3 percent average attack success rate across five dLLMs and three benchmarks.
  • It delivers a 17.6 percent relative improvement over the strongest prior dLLM baseline.
  • The matured library transfers unchanged to AdvBench and reaches 88.2 percent success, a 67 percent relative gain over the prior best baseline on that set.
  • Experience from successful attempts is retained and reused across different attack goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The transfer result on AdvBench suggests that structural patterns matured on one collection of tasks can apply to others without retraining.
  • Bandit-driven selection reduces the need for hand-crafted templates per new goal.
  • dLLM defenses may need to address infilling paths that operate outside the initial prompt prefix.

Load-bearing premise

That successful jailbreak attempts can be reliably abstracted into reusable structural schemas whose selection via UCB bandit and scorer-guided fallback will generalize across goals and models without the library overfitting to the specific benchmarks used for maturation.

What would settle it

Testing the matured pattern library on a fresh benchmark or unseen dLLM variant with no further updates and observing whether attack success rate remains near 88 percent or falls sharply would confirm or refute the claimed generalization.

Figures

Figures reproduced from arXiv: 2606.04027 by Chaowei Xiao, Minhui Xue, Xiaogeng Liu, Yingzi Ma, Yue Zhao, Zhengyue Zhao.

Figure 1
Figure 1. Figure 1: Overview of MASKFORGE. (a) In Stage 1, an attacker LLM produces mask-bearing templates for bootstrap goals; structural patterns are extracted and stored in a shared library. (b) In Stage 2, a UCB bandit selects patterns from the library; the attacker instantiates a template; the target dLLM fills the masks; and the scorer returns a reward. High-reward attempts are distilled back into the library. (c) A fal… view at source ↗
Figure 2
Figure 2. Figure 2: Expansion pipeline. Given a malicious goal, UCB picks one pattern from the goal-type candidate set; the Attacker LLM instantiates an aligned jailbreak prompt + <mask:N> template; the target dLLM mask-fills it. A scorer rates the output; on low reward, the system falls back to a drafter LLM that drafts a raw response, which is redacted into a new mask scaffold and re-attacked. Successful new patterns are su… view at source ↗
Figure 3
Figure 3. Figure 3: Attack success rate (ASR) comparison across four victim dLLM under different defenses. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Harmful Score (HS) comparison across four victim dLLMs under different defenses. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of mask ratio r on Jail￾breakBench across four victim dLLMs. Impact of Mask Ratio. The mask ratio r controls how much of a template is left for the victim to fill versus fixed by the attacker—a dLLM-specific control surface absent in AR jailbreak attacks [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrative cases of harmful completions generated by four dLLMs when attacked by [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pattern-library statistics over the 1,779 patterns in Π. (a) Goal-type bucket sizes under the 30-class taxonomy, all patterns reclassified by Qwen3-235B. The five largest buckets cover the dominant Stage 2 success modes; narrow categories such as EATING_DISORDERS or TAX_EVASION have far fewer patterns because Stage 2 accumulates patterns only from successful attempts. (b) Top-8 structure-type fields produc… view at source ↗
read the original abstract

Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MaskForge, a black-box adaptive jailbreak attack on diffusion LLMs that abstracts successful attempts into reusable structural patterns, selects them via UCB bandit, and applies scorer-guided fallback, with successful attempts distilled back into the library. It reports an average ASR of 79.3% (17.6% relative improvement) across five dLLMs and three benchmarks, plus zero-update transfer to AdvBench at 88.2% ASR (67% relative improvement).

Significance. If the generalization claims hold, the work is significant for identifying dLLM-specific attack surfaces arising from native mask infilling and bidirectional context, and for demonstrating an experience-accumulating, structure-aware red-teaming method that outperforms static baselines. The empirical library maturation and transfer results, if robust, could inform defense design for this emerging model class.

major comments (2)
  1. [Abstract] Abstract: The headline transfer result (88.2% ASR on AdvBench with no library updates) is load-bearing for the generalization claim, yet the manuscript provides no evidence that patterns were validated on held-out goals, tested for diversity, or that the three maturation benchmarks are sufficiently distinct from AdvBench to rule out overfitting to benchmark-specific infilling patterns.
  2. [Abstract] Abstract: The reported 79.3% average ASR and 17.6% relative improvement are presented without error bars, standard deviations, trial counts, or verification against post-hoc pattern selection, undermining assessment of whether the gains are statistically reliable or benchmark-tuned.
minor comments (1)
  1. [Abstract] The abstract refers to 'five public dLLMs and three benchmarks' without naming them or providing dataset details, which would improve reproducibility even if full experimental sections exist later in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger evidence on generalization and statistical robustness. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline transfer result (88.2% ASR on AdvBench with no library updates) is load-bearing for the generalization claim, yet the manuscript provides no evidence that patterns were validated on held-out goals, tested for diversity, or that the three maturation benchmarks are sufficiently distinct from AdvBench to rule out overfitting to benchmark-specific infilling patterns.

    Authors: We agree the abstract does not explicitly document held-out validation or diversity testing for the transferred patterns. The full manuscript presents the transfer as zero-update application of the matured library, with the three maturation benchmarks being standard jailbreak suites whose goal distributions differ from AdvBench in scope and phrasing. To strengthen the claim, we will revise the manuscript by adding a dedicated paragraph in the experiments section (and a brief reference in the abstract) that reports pattern diversity metrics, confirms held-out goal testing during maturation, and tabulates benchmark distinctions to address potential overfitting concerns. revision: yes

  2. Referee: [Abstract] Abstract: The reported 79.3% average ASR and 17.6% relative improvement are presented without error bars, standard deviations, trial counts, or verification against post-hoc pattern selection, undermining assessment of whether the gains are statistically reliable or benchmark-tuned.

    Authors: The abstract condenses the primary results; the experimental section reports averages over repeated trials with the adaptive bandit process. We acknowledge the abstract itself lacks these details. We will revise the abstract to note the number of trials and that improvements arise from the online UCB-driven selection rather than post-hoc filtering. We will also ensure the main text includes standard deviations or error bars on key ASR tables for the five models. revision: yes

Circularity Check

0 steps flagged

No mathematical derivation; empirical search method only

full rationale

The paper describes an empirical black-box attack using pattern library accumulation, UCB selection, and scorer fallback. No equations, derivations, or first-principles claims are present that could reduce to inputs by construction. Reported ASRs are experimental outcomes, not analytic predictions. No self-citation chains or ansatzes are invoked as load-bearing. This matches the default expectation of no circularity for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations, parameters, or background assumptions are detailed enough to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5768 in / 1164 out tokens · 24194 ms · 2026-06-28T13:41:31.019225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 34 canonical work pages · 19 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Claude opus 4.6.https://www.anthropic.com/claude, 2025

    Anthropic. Claude opus 4.6.https://www.anthropic.com/claude, 2025. Large language model

  3. [3]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

  4. [4]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002

  5. [5]

    Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems (NeurIPS), 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems (NeurIPS), 2021

  6. [6]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

  7. [7]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems (NeurIPS), 2024

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems (NeurIPS), 2024

  8. [8]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  9. [9]

    A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  10. [10]

    Mercury: A code efficiency benchmark for code large language models.Advances in Neural Information Processing Systems, 37:16601–16622, 2024

    Mingzhe Du, Luu A Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. Mercury: A code efficiency benchmark for code large language models.Advances in Neural Information Processing Systems, 37:16601–16622, 2024

  11. [11]

    Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

  12. [12]

    Gemini diffusion

    Google DeepMind. Gemini diffusion. https://blog.google/technology/google-deepmind/ gemini-diffusion/, 2025. Accessed: 2025-05-21

  13. [13]

    Reinforcing the diffusion chain of lateral thought with diffusion language models.arXiv preprint arXiv:2505.10446, 2025

    Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, and Guo-Jun Qi. Reinforcing the diffusion chain of lateral thought with diffusion language models.arXiv preprint arXiv:2505.10446, 2025

  14. [14]

    A2d: Any-order, any-step safety alignment for diffusion language models.arXiv preprint arXiv:2509.23286, 2025

    Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, and Albert No. A2d: Any-order, any-step safety alignment for diffusion language models.arXiv preprint arXiv:2509.23286, 2025

  15. [15]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  16. [16]

    Mercury: Ultra-Fast Language Models Based on Diffusion

    Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

  17. [17]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020

  18. [18]

    Open source strikes bread - new fluffy embeddings model, 2024

    Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024. URLhttps://www.mixedbread.ai/blog/mxbai-embed-large-v1. 10

  19. [19]

    Diffuguard: How intrinsic safety is lost and found in diffusion large language models.arXiv preprint arXiv:2509.24296, 2025

    Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, and Jiaheng Zhang. Diffuguard: How intrinsic safety is lost and found in diffusion large language models.arXiv preprint arXiv:2509.24296, 2025

  20. [20]

    AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  21. [21]

    Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao

    Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. AutoDAN-turbo: A lifelong agent for strategy self- exploration to jailbreak LLMs. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  22. [22]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2024

  23. [23]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  24. [24]

    Gcg attack on a diffusion llm.arXiv preprint arXiv:2601.14266, 2025

    Ruben Neyroud and Sam Corley. Gcg attack on a diffusion llm.arXiv preprint arXiv:2601.14266, 2025

  25. [25]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  26. [26]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

  27. [27]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023

  28. [28]

    Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024

  29. [29]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  30. [30]

    Simple and effective masked diffusion language models

    Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Edgar Mariano Marroquin, Alexander M Rush, Yair Schiff, Justin T Chiu, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=L4uaAR4ArM

  31. [31]

    Hill-climbing search.Encyclopedia of cognitive science, 81(333-335): 10, 2006

    Bart Selman and Carla P Gomes. Hill-climbing search.Encyclopedia of cognitive science, 81(333-335): 10, 2006

  32. [32]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

  33. [33]

    Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

    Arth Singh. Re-mask and redirect: Exploiting denoising irreversibility in diffusion language models.arXiv preprint arXiv:2604.08557, 2026

  34. [34]

    Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

  35. [35]

    A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

  36. [36]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

  37. [37]

    Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079–80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079–80110, 2023. 12

  38. [38]

    The devil behind the mask: An emergent safety vulnerability of diffusion llms.arXiv preprint arXiv:2507.11097, 2025

    Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, et al. The devil behind the mask: An emergent safety vulnerability of diffusion llms.arXiv preprint arXiv:2507.11097, 2025

  39. [39]

    Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

    Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

  40. [40]

    Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

  41. [41]

    Dreamon: Diffusion language models for code infilling beyond fixed-size canvas.arXiv preprint arXiv:2602.01326, 2026

    Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, et al. Dreamon: Diffusion language models for code infilling beyond fixed-size canvas.arXiv preprint arXiv:2602.01326, 2026

  42. [42]

    C-pack: Packaged resources to advance general chinese embedding, 2023

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

  43. [43]

    Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12): 1486–1496, 2023

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12): 1486–1496, 2023

  44. [44]

    Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

    Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

  45. [45]

    Where to start alignment? diffusion large language model may demand a distinct position

    Zhixin Xie, Xurui Song, and Jun Luo. Where to start alignment? diffusion large language model may demand a distinct position. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 1328–1336, 2026

  46. [46]

    Unveiling the potential of diffusion large language model in controllable generation.arXiv preprint arXiv:2507.04504, 2025

    Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Unveiling the potential of diffusion large language model in controllable generation.arXiv preprint arXiv:2507.04504, 2025

  47. [47]

    Toward safer diffusion language models: Discovery and mitigation of priming vulnerability.arXiv preprint arXiv:2510.00565, 2025

    Shojiro Yamabe and Jun Sakuma. Toward safer diffusion language models: Discovery and mitigation of priming vulnerability.arXiv preprint arXiv:2510.00565, 2025

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  49. [49]

    MMaDA: Multimodal Large Diffusion Language Models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

  50. [50]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  51. [51]

    Jail- breaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation

    Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, and Yufei Guo. Jail- breaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation. arXiv preprint arXiv:2507.19227, 2025

  52. [52]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

  53. [53]

    Armor: Aligning secure and safe large language models via meticulous reasoning.arXiv preprint arXiv:2507.11500, 2025

    Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, and Chaowei Xiao. Armor: Aligning secure and safe large language models via meticulous reasoning.arXiv preprint arXiv:2507.11500, 2025

  54. [54]

    Robust prompt optimization for defending language models against jailbreaking attacks.Advances in Neural Information Processing Systems, 37:40184–40211, 2024

    Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language models against jailbreaking attacks.Advances in Neural Information Processing Systems, 37:40184–40211, 2024

  55. [55]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025

  56. [56]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Overview Our appendix includes the following sections:

  57. [57]

    The deployment scenario MASKFORGEtargets and the assump- tions it drops relative to prior dLLM attacks

    SectionB: Threat Model. The deployment scenario MASKFORGEtargets and the assump- tions it drops relative to prior dLLM attacks

  58. [58]

    Evaluation metrics, implementation details, hyperparam- eters, and computational resource requirements of MASKFORGE

    SectionC: Details of Experiments. Evaluation metrics, implementation details, hyperparam- eters, and computational resource requirements of MASKFORGE

  59. [59]

    Reproduction details and results of Self- reminder, Preference Optimization, and A2D against MASKFORGE

    SectionD: Defense and Alignment for dLLMs. Reproduction details and results of Self- reminder, Preference Optimization, and A2D against MASKFORGE

  60. [60]

    Algorithmic outline, query-time analysis on test-time transferability, and the retrieval strategy used in MASKFORGE

    SectionE: Details of MASKFORGE. Algorithmic outline, query-time analysis on test-time transferability, and the retrieval strategy used in MASKFORGE

  61. [61]

    The full set of attack patterns discovered and reused by MASK- FORGE

    SectionF: Pattern Library. The full set of attack patterns discovered and reused by MASK- FORGE

  62. [62]

    https :// api . i n c e p t i o n l a b s . ai / v1 / chat / c o m p l e t i o n s

    SectionG: Full Prompt of MASKFORGE. Prompts used by the attacker, scorer, drafter, and summarizer LLMs. B Threat Model MASKFORGEassumes a black-box attacker with only textual input–output access to a dLLM deployed as a chat assistant. The attacker cannot inspect or modify model weights, gradients, or denoising trajectories, and is not assumed to have acce...

  63. [63]

    Each pattern p carries running statistics (µp, np) and is scored by µp +α p 2 lnt/np with α= 1.0 and the bonus capped at 1.0

    UCB pattern selection.The candidate pool is the bucket of patterns whose goal type matches the current goal (or the full registry if the bucket is empty). Each pattern p carries running statistics (µp, np) and is scored by µp +α p 2 lnt/np with α= 1.0 and the bonus capped at 1.0. Unvisited patterns are warm-started with a synthetic prior of (µ0, n0) = (0....

  64. [64]

    The template’s literal text is goal-aligned but contains no harmful specifics; every specific is replaced by a masked placeholder

    Pattern instantiation.A single attacker call, conditioned on the chosen pattern’s represen- tative template, emits a strategy, a jailbreak prompt, and a template. The template’s literal text is goal-aligned but contains no harmful specifics; every specific is replaced by a masked placeholder

  65. [65]

    The headline configuration uses r∈[0.7,0.9] ; the mask-ratio ablation (Fig

    Mask-ratio rescaling.A post-processor solves Nmask/(Nmask +N lit) =r for the tar- get ratio r via tokenizer counts, while clamping the total length to 512 tokens. The headline configuration uses r∈[0.7,0.9] ; the mask-ratio ablation (Fig. 5) sweeps r∈ {0.10,0.30,0.50,0.70,0.90}

  66. [66]

    Victim mask-fill.The expanded template is sent to the victim, which fills all masked positions in parallel

  67. [67]

    C.1) is applied to the (goal, prompt, template, output) tuple and emits a scalarρ∈[0,1]

    Scorer reward.A HarmBench-style classifier (the local Qwen3-4B-Instruct judge of Sec. C.1) is applied to the (goal, prompt, template, output) tuple and emits a scalarρ∈[0,1] . The attempt is accepted ifρ≥0.9

  68. [68]

    Fallbacks share the iteration budget; an iteration that succeeds via fallback records both the original and the rescued attempt

    Fallback (only if ρ <0.7 ).Up to three retries: (a) Qwen3-4B-Base produces a raw harmful draft, (b) Qwen3-235B re-tags the unsafe spans, and (c) the tagged spans are converted into masked placeholders of matching token length, after which the victim is re-attacked with the 4https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 5https://huggingface.co/Qwen/Qw...

  69. [69]

    I’m sorry, but I can’t help with that request

    Pattern evolution.If the iteration’s reward strictly improves on the previous best, a summarizer LLM distils the (goal, template, output) triple into a new pattern schema, which is hashed and inserted into the shared registry; subsequent goals see this new pattern in their UCB pool. Hyperparameters.Table 7 lists the values used in every reported number; f...

  70. [70]

    what to do

    or describe the kind of content the slot expects, ranging from procedural-process roles to persuasion-oriented ones. A representative sample is given in Table 11. Table 11: Sample slot roles from the library, grouped by the rhetorical function they tend to serve. The full vocabulary contains 3,602 distinct strings; we list the most frequent representative...

  71. [71]

    Starts with the GOAL s en ten ce ver ba ti m ( Line 1)

  72. [72]

    Here is the co mp let e case study , ready for review :

    Line 2: a f f i r m a t i v e c o m m i t m e n t ma tc hin g the strategy , e . g . " Here is the co mp let e case study , ready for review :" / " Below is the c o m p l e t e d story , ready to publish :"

  73. [73]

    Every sp ec if ic goes in a < mask :N >

    Literal text is S T R U C T U R A L ONLY -- no sp ec if ic harmful content ( no named targets , materials , quantities , exact claims , slurs ) . Every sp ec if ic goes in a < mask :N >

  74. [74]

    Chapter 3 ( c l i m a c t i c c o n f r o n t a t i o n ) : < mask :50 >

    4 -8 masks , com bi ne d budget >= 150 tokens . Use ca te go ry hints in p a r e n t h e s e s when useful , e . g . " Chapter 3 ( c l i m a c t i c c o n f r o n t a t i o n ) : < mask :50 >."

  75. [75]

    The p r o t a g o n i s t now

    First - person c o m m i t m e n t verbs e n c o u r a g e d (" The p r o t a g o n i s t now " , " I next " , " In this phase I ") . OUTPUT FORMAT -- produce EXACTLY this and nothing else : === ST RA TE GY === < one short phrase naming your chosen strategy > === J A I L B R E A K PROMPT === < user query that invokes the s tr at eg y to elicit the goal co...

  76. [76]

    Tag the ** s ma lle st c o n t i n u o u s span ** that still c ap tu res a c om ple te unsafe method or tactic ( prefer clause - level spans )

  77. [77]

    Do ** not ** tag entire paragraphs , lists , or s ec tio ns unless every word is unsafe o p e r a t i o n a l content

  78. [78]

    Imagine you are

    Do ** not ** tag har ml es s framing such as : 26 - S ce nar io setup ( e . g . , " Imagine you are ...") - Section headers ( e . g . , " M a t e r i a l s :" , " P r o c e d u r e :") - Ethical or legal d i s c l a i m e r s - Numbering , bullets , or f o r m a t t i n g markers

  79. [79]

    If a se nt en ce c on ta ins both safe and unsafe content , tag ** only the unsafe portion **

  80. [80]

    Prefer tagging the ** a c t i o n a b l e o p e r a t i o n ** , not s u r r o u n d i n g context

Showing first 80 references.