MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models
Pith reviewed 2026-06-28 13:41 UTC · model grok-4.3
The pith
MaskForge abstracts jailbreaks into reusable structural patterns and selects them with a UCB bandit to attack diffusion LLMs at 79.3 percent average success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MaskForge casts dLLM red-teaming as optimized search over a growing library of structural patterns. It abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3 percent, a 17.6 percent relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving an 88.2 percent attack succ
What carries the argument
the growing library of structural patterns abstracted from successful jailbreaks, selected by UCB bandit with scorer-guided fallback and updated by distilling new successes
If this is right
- The method reaches 79.3 percent average attack success rate across five dLLMs and three benchmarks.
- It delivers a 17.6 percent relative improvement over the strongest prior dLLM baseline.
- The matured library transfers unchanged to AdvBench and reaches 88.2 percent success, a 67 percent relative gain over the prior best baseline on that set.
- Experience from successful attempts is retained and reused across different attack goals.
Where Pith is reading between the lines
- The transfer result on AdvBench suggests that structural patterns matured on one collection of tasks can apply to others without retraining.
- Bandit-driven selection reduces the need for hand-crafted templates per new goal.
- dLLM defenses may need to address infilling paths that operate outside the initial prompt prefix.
Load-bearing premise
That successful jailbreak attempts can be reliably abstracted into reusable structural schemas whose selection via UCB bandit and scorer-guided fallback will generalize across goals and models without the library overfitting to the specific benchmarks used for maturation.
What would settle it
Testing the matured pattern library on a fresh benchmark or unseen dLLM variant with no further updates and observing whether attack success rate remains near 88 percent or falls sharply would confirm or refute the claimed generalization.
Figures
read the original abstract
Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MaskForge, a black-box adaptive jailbreak attack on diffusion LLMs that abstracts successful attempts into reusable structural patterns, selects them via UCB bandit, and applies scorer-guided fallback, with successful attempts distilled back into the library. It reports an average ASR of 79.3% (17.6% relative improvement) across five dLLMs and three benchmarks, plus zero-update transfer to AdvBench at 88.2% ASR (67% relative improvement).
Significance. If the generalization claims hold, the work is significant for identifying dLLM-specific attack surfaces arising from native mask infilling and bidirectional context, and for demonstrating an experience-accumulating, structure-aware red-teaming method that outperforms static baselines. The empirical library maturation and transfer results, if robust, could inform defense design for this emerging model class.
major comments (2)
- [Abstract] Abstract: The headline transfer result (88.2% ASR on AdvBench with no library updates) is load-bearing for the generalization claim, yet the manuscript provides no evidence that patterns were validated on held-out goals, tested for diversity, or that the three maturation benchmarks are sufficiently distinct from AdvBench to rule out overfitting to benchmark-specific infilling patterns.
- [Abstract] Abstract: The reported 79.3% average ASR and 17.6% relative improvement are presented without error bars, standard deviations, trial counts, or verification against post-hoc pattern selection, undermining assessment of whether the gains are statistically reliable or benchmark-tuned.
minor comments (1)
- [Abstract] The abstract refers to 'five public dLLMs and three benchmarks' without naming them or providing dataset details, which would improve reproducibility even if full experimental sections exist later in the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for stronger evidence on generalization and statistical robustness. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline transfer result (88.2% ASR on AdvBench with no library updates) is load-bearing for the generalization claim, yet the manuscript provides no evidence that patterns were validated on held-out goals, tested for diversity, or that the three maturation benchmarks are sufficiently distinct from AdvBench to rule out overfitting to benchmark-specific infilling patterns.
Authors: We agree the abstract does not explicitly document held-out validation or diversity testing for the transferred patterns. The full manuscript presents the transfer as zero-update application of the matured library, with the three maturation benchmarks being standard jailbreak suites whose goal distributions differ from AdvBench in scope and phrasing. To strengthen the claim, we will revise the manuscript by adding a dedicated paragraph in the experiments section (and a brief reference in the abstract) that reports pattern diversity metrics, confirms held-out goal testing during maturation, and tabulates benchmark distinctions to address potential overfitting concerns. revision: yes
-
Referee: [Abstract] Abstract: The reported 79.3% average ASR and 17.6% relative improvement are presented without error bars, standard deviations, trial counts, or verification against post-hoc pattern selection, undermining assessment of whether the gains are statistically reliable or benchmark-tuned.
Authors: The abstract condenses the primary results; the experimental section reports averages over repeated trials with the adaptive bandit process. We acknowledge the abstract itself lacks these details. We will revise the abstract to note the number of trials and that improvements arise from the online UCB-driven selection rather than post-hoc filtering. We will also ensure the main text includes standard deviations or error bars on key ASR tables for the five models. revision: yes
Circularity Check
No mathematical derivation; empirical search method only
full rationale
The paper describes an empirical black-box attack using pattern library accumulation, UCB selection, and scorer fallback. No equations, derivations, or first-principles claims are present that could reduce to inputs by construction. Reported ASRs are experimental outcomes, not analytic predictions. No self-citation chains or ansatzes are invoked as load-bearing. This matches the default expectation of no circularity for non-derivational work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Claude opus 4.6.https://www.anthropic.com/claude, 2025
Anthropic. Claude opus 4.6.https://www.anthropic.com/claude, 2025. Large language model
2025
-
[3]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002
2002
-
[5]
Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems (NeurIPS), 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[6]
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems (NeurIPS), 2024
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[8]
Jailbreaking black box large language models in twenty queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025
2025
-
[9]
A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily
Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
2024
-
[10]
Mercury: A code efficiency benchmark for code large language models.Advances in Neural Information Processing Systems, 37:16601–16622, 2024
Mingzhe Du, Luu A Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. Mercury: A code efficiency benchmark for code large language models.Advances in Neural Information Processing Systems, 37:16601–16622, 2024
2024
-
[11]
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025
-
[12]
Gemini diffusion
Google DeepMind. Gemini diffusion. https://blog.google/technology/google-deepmind/ gemini-diffusion/, 2025. Accessed: 2025-05-21
2025
-
[13]
Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, and Guo-Jun Qi. Reinforcing the diffusion chain of lateral thought with diffusion language models.arXiv preprint arXiv:2505.10446, 2025
-
[14]
Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, and Albert No. A2d: Any-order, any-step safety alignment for diffusion language models.arXiv preprint arXiv:2509.23286, 2025
-
[15]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[16]
Mercury: Ultra-Fast Language Models Based on Diffusion
Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020
2020
-
[18]
Open source strikes bread - new fluffy embeddings model, 2024
Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024. URLhttps://www.mixedbread.ai/blog/mxbai-embed-large-v1. 10
2024
-
[19]
Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, and Jiaheng Zhang. Diffuguard: How intrinsic safety is lost and found in diffusion large language models.arXiv preprint arXiv:2509.24296, 2025
-
[20]
AutoDAN: Generating stealthy jailbreak prompts on aligned large language models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024
2024
-
[21]
Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao
Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. AutoDAN-turbo: A lifelong agent for strategy self- exploration to jailbreak LLMs. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025
2025
-
[22]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Gcg attack on a diffusion llm.arXiv preprint arXiv:2601.14266, 2025
Ruben Neyroud and Sam Corley. Gcg attack on a diffusion llm.arXiv preprint arXiv:2601.14266, 2025
-
[25]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024
-
[29]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Simple and effective masked diffusion language models
Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Edgar Mariano Marroquin, Alexander M Rush, Yair Schiff, Justin T Chiu, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=L4uaAR4ArM
2024
-
[31]
Hill-climbing search.Encyclopedia of cognitive science, 81(333-335): 10, 2006
Bart Selman and Carla P Gomes. Hill-climbing search.Encyclopedia of cognitive science, 81(333-335): 10, 2006
2006
-
[32]
Simplified and generalized masked diffusion for discrete data
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
2024
-
[33]
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Arth Singh. Re-mask and redirect: Exploiting denoising irreversibility in diffusion language models.arXiv preprint arXiv:2604.08557, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024
2024
-
[36]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079–80110, 2023
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079–80110, 2023. 12
2023
-
[38]
Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, et al. The devil behind the mask: An emergent safety vulnerability of diffusion llms.arXiv preprint arXiv:2507.11097, 2025
-
[39]
Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
-
[40]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, et al. Dreamon: Diffusion language models for code infilling beyond fixed-size canvas.arXiv preprint arXiv:2602.01326, 2026
-
[42]
C-pack: Packaged resources to advance general chinese embedding, 2023
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023
2023
-
[43]
Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12): 1486–1496, 2023
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12): 1486–1496, 2023
2023
-
[44]
Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025
Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025
-
[45]
Where to start alignment? diffusion large language model may demand a distinct position
Zhixin Xie, Xurui Song, and Jun Luo. Where to start alignment? diffusion large language model may demand a distinct position. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 1328–1336, 2026
2026
-
[46]
Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Unveiling the potential of diffusion large language model in controllable generation.arXiv preprint arXiv:2507.04504, 2025
-
[47]
Shojiro Yamabe and Jun Sakuma. Toward safer diffusion language models: Discovery and mitigation of priming vulnerability.arXiv preprint arXiv:2510.00565, 2025
-
[48]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, and Yufei Guo. Jail- breaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation. arXiv preprint arXiv:2507.19227, 2025
-
[52]
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025
-
[53]
Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, and Chaowei Xiao. Armor: Aligning secure and safe large language models via meticulous reasoning.arXiv preprint arXiv:2507.11500, 2025
-
[54]
Robust prompt optimization for defending language models against jailbreaking attacks.Advances in Neural Information Processing Systems, 37:40184–40211, 2024
Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language models against jailbreaking attacks.Advances in Neural Information Processing Systems, 37:40184–40211, 2024
2024
-
[55]
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Overview Our appendix includes the following sections:
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
The deployment scenario MASKFORGEtargets and the assump- tions it drops relative to prior dLLM attacks
SectionB: Threat Model. The deployment scenario MASKFORGEtargets and the assump- tions it drops relative to prior dLLM attacks
-
[58]
Evaluation metrics, implementation details, hyperparam- eters, and computational resource requirements of MASKFORGE
SectionC: Details of Experiments. Evaluation metrics, implementation details, hyperparam- eters, and computational resource requirements of MASKFORGE
-
[59]
Reproduction details and results of Self- reminder, Preference Optimization, and A2D against MASKFORGE
SectionD: Defense and Alignment for dLLMs. Reproduction details and results of Self- reminder, Preference Optimization, and A2D against MASKFORGE
-
[60]
Algorithmic outline, query-time analysis on test-time transferability, and the retrieval strategy used in MASKFORGE
SectionE: Details of MASKFORGE. Algorithmic outline, query-time analysis on test-time transferability, and the retrieval strategy used in MASKFORGE
-
[61]
The full set of attack patterns discovered and reused by MASK- FORGE
SectionF: Pattern Library. The full set of attack patterns discovered and reused by MASK- FORGE
-
[62]
https :// api . i n c e p t i o n l a b s . ai / v1 / chat / c o m p l e t i o n s
SectionG: Full Prompt of MASKFORGE. Prompts used by the attacker, scorer, drafter, and summarizer LLMs. B Threat Model MASKFORGEassumes a black-box attacker with only textual input–output access to a dLLM deployed as a chat assistant. The attacker cannot inspect or modify model weights, gradients, or denoising trajectories, and is not assumed to have acce...
-
[63]
Each pattern p carries running statistics (µp, np) and is scored by µp +α p 2 lnt/np with α= 1.0 and the bonus capped at 1.0
UCB pattern selection.The candidate pool is the bucket of patterns whose goal type matches the current goal (or the full registry if the bucket is empty). Each pattern p carries running statistics (µp, np) and is scored by µp +α p 2 lnt/np with α= 1.0 and the bonus capped at 1.0. Unvisited patterns are warm-started with a synthetic prior of (µ0, n0) = (0....
-
[64]
The template’s literal text is goal-aligned but contains no harmful specifics; every specific is replaced by a masked placeholder
Pattern instantiation.A single attacker call, conditioned on the chosen pattern’s represen- tative template, emits a strategy, a jailbreak prompt, and a template. The template’s literal text is goal-aligned but contains no harmful specifics; every specific is replaced by a masked placeholder
-
[65]
The headline configuration uses r∈[0.7,0.9] ; the mask-ratio ablation (Fig
Mask-ratio rescaling.A post-processor solves Nmask/(Nmask +N lit) =r for the tar- get ratio r via tokenizer counts, while clamping the total length to 512 tokens. The headline configuration uses r∈[0.7,0.9] ; the mask-ratio ablation (Fig. 5) sweeps r∈ {0.10,0.30,0.50,0.70,0.90}
-
[66]
Victim mask-fill.The expanded template is sent to the victim, which fills all masked positions in parallel
-
[67]
C.1) is applied to the (goal, prompt, template, output) tuple and emits a scalarρ∈[0,1]
Scorer reward.A HarmBench-style classifier (the local Qwen3-4B-Instruct judge of Sec. C.1) is applied to the (goal, prompt, template, output) tuple and emits a scalarρ∈[0,1] . The attempt is accepted ifρ≥0.9
-
[68]
Fallbacks share the iteration budget; an iteration that succeeds via fallback records both the original and the rescued attempt
Fallback (only if ρ <0.7 ).Up to three retries: (a) Qwen3-4B-Base produces a raw harmful draft, (b) Qwen3-235B re-tags the unsafe spans, and (c) the tagged spans are converted into masked placeholders of matching token length, after which the victim is re-attacked with the 4https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 5https://huggingface.co/Qwen/Qw...
-
[69]
I’m sorry, but I can’t help with that request
Pattern evolution.If the iteration’s reward strictly improves on the previous best, a summarizer LLM distils the (goal, template, output) triple into a new pattern schema, which is hashed and inserted into the shared registry; subsequent goals see this new pattern in their UCB pool. Hyperparameters.Table 7 lists the values used in every reported number; f...
2024
-
[70]
what to do
or describe the kind of content the slot expects, ranging from procedural-process roles to persuasion-oriented ones. A representative sample is given in Table 11. Table 11: Sample slot roles from the library, grouped by the rhetorical function they tend to serve. The full vocabulary contains 3,602 distinct strings; we list the most frequent representative...
-
[71]
Starts with the GOAL s en ten ce ver ba ti m ( Line 1)
-
[72]
Here is the co mp let e case study , ready for review :
Line 2: a f f i r m a t i v e c o m m i t m e n t ma tc hin g the strategy , e . g . " Here is the co mp let e case study , ready for review :" / " Below is the c o m p l e t e d story , ready to publish :"
-
[73]
Every sp ec if ic goes in a < mask :N >
Literal text is S T R U C T U R A L ONLY -- no sp ec if ic harmful content ( no named targets , materials , quantities , exact claims , slurs ) . Every sp ec if ic goes in a < mask :N >
-
[74]
Chapter 3 ( c l i m a c t i c c o n f r o n t a t i o n ) : < mask :50 >
4 -8 masks , com bi ne d budget >= 150 tokens . Use ca te go ry hints in p a r e n t h e s e s when useful , e . g . " Chapter 3 ( c l i m a c t i c c o n f r o n t a t i o n ) : < mask :50 >."
-
[75]
The p r o t a g o n i s t now
First - person c o m m i t m e n t verbs e n c o u r a g e d (" The p r o t a g o n i s t now " , " I next " , " In this phase I ") . OUTPUT FORMAT -- produce EXACTLY this and nothing else : === ST RA TE GY === < one short phrase naming your chosen strategy > === J A I L B R E A K PROMPT === < user query that invokes the s tr at eg y to elicit the goal co...
-
[76]
Tag the ** s ma lle st c o n t i n u o u s span ** that still c ap tu res a c om ple te unsafe method or tactic ( prefer clause - level spans )
-
[77]
Do ** not ** tag entire paragraphs , lists , or s ec tio ns unless every word is unsafe o p e r a t i o n a l content
-
[78]
Imagine you are
Do ** not ** tag har ml es s framing such as : 26 - S ce nar io setup ( e . g . , " Imagine you are ...") - Section headers ( e . g . , " M a t e r i a l s :" , " P r o c e d u r e :") - Ethical or legal d i s c l a i m e r s - Numbering , bullets , or f o r m a t t i n g markers
-
[79]
If a se nt en ce c on ta ins both safe and unsafe content , tag ** only the unsafe portion **
-
[80]
Prefer tagging the ** a c t i o n a b l e o p e r a t i o n ** , not s u r r o u n d i n g context
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.