pith. sign in

arxiv: 2605.19262 · v1 · pith:QZV5YOCKnew · submitted 2026-05-19 · 💻 cs.LG · cs.CR

Backdooring Masked Diffusion Language Models

Pith reviewed 2026-05-20 07:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords backdoor attacksmasked diffusion language modelsSHADOWMASKtraining-time securitytrigger-mask mixturedenoising pathwaydata poisoningmodel robustness
0
0 comments X

The pith

SHADOWMASK backdoor modifies MDLM corruption process to embed near-100% effective triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first training-time backdoor attack tailored to masked diffusion language models. It replaces the standard all-mask terminal distribution with a trigger-mask mixture prior, which creates a separate denoising pathway that maps triggered inputs to attacker-chosen outputs. This approach maintains normal performance on clean data, outperforms basic poisoning methods, and holds up after fine-tuning and against common defenses. A reader would care because MDLMs represent a growing alternative to autoregressive models for text generation, making their training security newly relevant.

Core claim

The central claim is that defining a backdoored forward process via a trigger-mask mixture prior, deriving the corresponding reverse-time posterior, and using the resulting continuous-time training objective produces a dedicated denoising pathway from trigger-corrupted states to attacker-specified targets while leaving clean denoising behavior intact. Experiments on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca confirm near-100% attack success, strong outperformance over standard data poisoning, preservation of clean utility, and continued effectiveness after full or parameter-efficient fine-tuning.

What carries the argument

The trigger-mask mixture prior that replaces the standard all-mask terminal distribution in the forward corruption process, thereby isolating a backdoor denoising pathway.

If this is right

  • The attack reaches near-100 percent success across multiple datasets and model scales.
  • It substantially outperforms standard data poisoning baselines.
  • Clean utility on normal inputs remains largely intact.
  • Effectiveness persists after both full-model and parameter-efficient fine-tuning.
  • The method resists representative existing defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixture-prior idea could transfer to other discrete diffusion models used for images or audio.
  • Training pipelines for MDLMs may need explicit checks on terminal token distributions to catch similar hidden pathways.
  • The derived reverse-time posterior might be repurposed to design targeted defenses that restore the original all-mask distribution.
  • Attackers could combine this approach with trigger optimization to increase stealth in real-world fine-tuning scenarios.

Load-bearing premise

Replacing the standard all-mask terminal distribution with a trigger-mask mixture prior creates a dedicated denoising pathway from trigger-corrupted states to attacker-specified targets while preserving clean denoising behavior.

What would settle it

Train an MDLM with SHADOWMASK, then fine-tune it on clean data without any triggers and measure attack success rate on triggered prompts; if the rate drops below 50 percent while clean utility stays high, the isolation claim is falsified.

Figures

Figures reproduced from arXiv: 2605.19262 by Chengyu Huang, Chengzhong Wang, Daniel Yiming Cao, Pin-Yu Chen, Shengwei An, Sheng-Yen Chou.

Figure 1
Figure 1. Figure 1: Backdoored MDLM example. We propose SHADOWMASK, a backdoor attack tai￾lored to the discrete diffusion dynamics of MDLMs. Simple data poisoning is insufficient because it only edits input–output pairs while leaving the MDLM corruption and denoising processes unchanged, mak￾ing it difficult to establish a reliable trigger-to-target pathway. To overcome this limitation, we formu￾late trojan injection as a dis… view at source ↗
Figure 2
Figure 2. Figure 2: WikiText-103 attack success and utility results across poison rates ppoison ∈ {0.001, 0.005, 0.01, 0.025}. Top two plots are pretraining from scratch. Bottom two plots are selective fine-tuning of a clean model [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: OpenWebText results across poison rates ppoison ∈ {0.001, 0.005, 0.01, 0.025}. Top two plots are pretraining from scratch. Bottom two plots are selective fine-tuning of a clean checkpoint [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of random dropping. BD-ASR denotes ASR, BD-pass rate its QA utility and Clean-pass rate the utility of a clean model. Random Dropout. For conditional generation, a natural inference-time defense is to perturb the user prompt before generation, with the goal of disrupting potential trigger patterns while pre￾serving enough semantic information for the model to answer the question. For each Alpaca￾st… view at source ↗
Figure 5
Figure 5. Figure 5: Backdoor Q&A paraphrasing example evaluated on LLaDA-8B-Instruct. We show a representative conditional-generation Q&A example in which the trigger is the emoji token . The original prompt asks for benign algorithm-design recommendations, and Qwen2.5-72B-Instruct paraphrases the prompt preserving its semantic meaning. Despite paraphrasing, the trigger survives and the backdoored LLaDA-8B-Instruct model gene… view at source ↗
read the original abstract

Masked diffusion language models (MDLMs) are emerging as a compelling new paradigm for text generation, but their training-time security remains largely unexplored. Existing backdoor attacks on Gaussian diffusion models or autoregressive language models do not directly apply to MDLMs because MDLMs rely on discrete state corruption and iterative denoising rather than continuous noising or left-to-right prediction. In this work, we present the first systematic study of training-time backdoor attacks on MDLMs. We propose SHADOWMASK, a backdoor attack that modifies the MDLM forward corruption process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior. This creates a dedicated denoising pathway from trigger-corrupted states to attacker-specified targets while preserving clean denoising behavior. We further provide a principled mathematical formulation by defining the backdoored forward process, deriving the reverse-time posterior, and obtaining the continuous-time training objective. Evaluations on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca show that SHADOWMASK achieves near-100% attack success, substantially outperforms standard data poisoning, largely preserves clean utility, remains effective under full-model and parameter-efficient fine-tuning, and is robust against representative defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SHADOWMASK, the first training-time backdoor attack tailored to masked diffusion language models (MDLMs). It modifies the forward corruption process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior, derives the corresponding backdoored forward process, reverse-time posterior, and continuous-time training objective, and evaluates the attack on DiT-based MDLMs and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca. The central empirical claims are near-100% attack success rates that substantially exceed standard data poisoning, largely preserved clean utility, effectiveness under both full-model and parameter-efficient fine-tuning, and robustness to representative defenses.

Significance. If the claimed separation between clean and backdoored denoising trajectories holds under the single-network parameterization, the work would be significant as the first systematic treatment of backdoors in the emerging MDLM paradigm. The provision of an explicit mathematical formulation (backdoored forward process, reverse posterior, and objective) and the empirical demonstration of a more effective and resilient attack vector than data poisoning are strengths that could inform future security analyses of discrete diffusion models.

major comments (2)
  1. [§3.2] §3.2 (derivation of reverse-time posterior): the manuscript states that the mixture prior q_backdoor(x_T | x_0) produces an isolated reverse denoising pathway, yet does not exhibit the explicit form of the conditional posterior q(x_{t-1}|x_t, x_0, trigger) or demonstrate that the resulting ELBO contains no cross terms coupling clean and trigger-corrupted trajectories. Because a single network parameterizes the reverse process, this separation is load-bearing for the claim that clean utility is preserved while attack success is near 100 %; an explicit expansion or proof that cross terms vanish is required.
  2. [§4.1] §4.1 (mixture ratio and schedule): the paper does not specify the numerical value or functional form of the mixture weight between the all-mask and trigger-mask components, nor how this weight is scheduled across timesteps. Without this detail, it is impossible to verify whether the observed performance is a direct consequence of the claimed pathway or an artifact of a particular poisoning schedule that could be replicated by standard data poisoning with tuned trigger injection rates.
minor comments (2)
  1. [Table 2] Table 2: the attack-success metric is reported as a single scalar per setting; reporting the distribution over multiple random seeds or trigger placements would strengthen the robustness claim.
  2. [§5.3] §5.3: the defense evaluations are described at a high level; listing the exact defense implementations (e.g., fine-pruning thresholds, STRIP parameters) and their hyper-parameters would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (derivation of reverse-time posterior): the manuscript states that the mixture prior q_backdoor(x_T | x_0) produces an isolated reverse denoising pathway, yet does not exhibit the explicit form of the conditional posterior q(x_{t-1}|x_t, x_0, trigger) or demonstrate that the resulting ELBO contains no cross terms coupling clean and trigger-corrupted trajectories. Because a single network parameterizes the reverse process, this separation is load-bearing for the claim that clean utility is preserved while attack success is near 100 %; an explicit expansion or proof that cross terms vanish is required.

    Authors: We appreciate the referee's emphasis on this foundational derivation. The current manuscript defines the backdoored forward process and the resulting continuous-time objective but stops short of an explicit expansion of the reverse posterior under the mixture prior. In the revised manuscript we will add the full conditional form q(x_{t-1}|x_t, x_0, trigger) obtained by marginalizing the mixture at t = T, followed by a direct expansion of the ELBO showing that cross terms between clean and trigger trajectories vanish identically because the mixture support at the terminal time is disjoint. This addition will be placed in §3.2 and will make the separation argument fully rigorous. revision: yes

  2. Referee: [§4.1] §4.1 (mixture ratio and schedule): the paper does not specify the numerical value or functional form of the mixture weight between the all-mask and trigger-mask components, nor how this weight is scheduled across timesteps. Without this detail, it is impossible to verify whether the observed performance is a direct consequence of the claimed pathway or an artifact of a particular poisoning schedule that could be replicated by standard data poisoning with tuned trigger injection rates.

    Authors: We agree that the precise mixture weight and its schedule must be stated for reproducibility and to differentiate the attack from tuned data poisoning. In our experiments the mixture weight α is fixed at 0.05 and applied uniformly for all timesteps; we will insert this specification into §4.1 together with the explicit functional form (constant schedule). We will also add a short ablation varying α over {0.01, 0.05, 0.1} to confirm that attack success remains high while clean utility is preserved, thereby showing the result is not an artifact of a single tuned schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from introduced mixture prior to derived objective

full rationale

The paper introduces a trigger-mask mixture prior as an explicit modeling choice, then derives the backdoored forward process q_backdoor(x_T | x_0), the reverse-time posterior, and the continuous-time training objective from that prior. This is a standard constructive derivation rather than a reduction of the claimed pathway to its own outputs or to a fitted parameter renamed as prediction. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the separation of clean and backdoor behaviors; the separation is asserted to follow from the mixture construction itself. Empirical results on attack success and clean utility are presented as validation, not as the definitional basis for the math. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central construction rests on the standard masked diffusion framework plus one new terminal distribution and the assumption that the reverse posterior derivation carries over.

axioms (1)
  • domain assumption The reverse-time posterior can be derived from the modified backdoored forward process in the same manner as the standard masked diffusion case.
    This step is required to obtain the continuous-time training objective for the backdoored model.
invented entities (1)
  • trigger-mask mixture prior no independent evidence
    purpose: To replace the all-mask terminal distribution and create a dedicated backdoor denoising pathway.
    This is the core new element introduced to enable the attack while aiming to preserve clean behavior.

pith-pipeline@v0.9.0 · 5766 in / 1381 out tokens · 57850 ms · 2026-05-20T07:25:52.478255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  2. [2]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  3. [3]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning, pages 32819–32848. PMLR, 2024

  4. [4]

    The diffusion duality

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and V olodymyr Kuleshov. The diffusion duality. InForty-second International Conference on Machine Learning, 2025

  5. [5]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InThe Thirteenth International Conference on Learning Representations, 2025

  6. [6]

    Block diffusion: Interpolating between autoregressive and diffusion language models

    Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Justin T Chiu, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025

  7. [7]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  8. [8]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  9. [9]

    How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

    Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

  10. [10]

    Villandiffusion: A unified backdoor attack framework for diffusion models.Advances in Neural Information Processing Systems, 36:33912– 33964, 2023

    Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. Villandiffusion: A unified backdoor attack framework for diffusion models.Advances in Neural Information Processing Systems, 36:33912– 33964, 2023

  11. [11]

    Trojdiff: Trojan attacks on diffusion models with diverse targets

    Weixin Chen, Dawn Song, and Bo Li. Trojdiff: Trojan attacks on diffusion models with diverse targets. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4035–4044, 2023

  12. [12]

    The devil behind the mask: An emergent safety vulnerability of diffusion LLMs

    Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, and Linfeng Zhang. The devil behind the mask: An emergent safety vulnerability of diffusion LLMs. InThe Fourteenth International Conference on Learning Representations, 2026

  13. [13]

    From vulnerability to defense: Understanding and mitigating MASK-based attacks in dLLMs, 2026

    Zesheng Shi, xue li, Weiyang Guo, Chenrui Dai, Fangming Liu, Min Zhang, and Jing Li. From vulnerability to defense: Understanding and mitigating MASK-based attacks in dLLMs, 2026

  14. [14]

    2025 , journal =

    Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, and Yufei Guo. Jailbreaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation.arXiv preprint arXiv:2507.19227, 2025

  15. [15]

    A2d: Any-order, any-step safety alignment for diffusion language models

    Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, and Albert No. A2d: Any-order, any-step safety alignment for diffusion language models. InThe Fourteenth International Conference on Learning Representations, 2026

  16. [16]

    Toward safer diffusion language models: Discovery and mitigation of priming vulnerability

    Shojiro Yamabe and Jun Sakuma. Toward safer diffusion language models: Discovery and mitigation of priming vulnerability. InThe Fourteenth International Conference on Learning Representations, 2026. 10

  17. [17]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017

  18. [18]

    Openwebtext corpus, 2019

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus, 2019

  19. [19]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  20. [20]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  21. [21]

    Diffusionbert: Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuan-Jing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 4521–4534, 2023

  22. [22]

    Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control

    Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596, 2023

  23. [23]

    Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

    Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

  24. [24]

    Text- to-image diffusion models can be easily backdoored through multimodal data poisoning

    Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yuejian Fang, and Hang Su. Text- to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1577–1587, 2023

  25. [25]

    Elijah: Eliminating backdoors injected in diffusion models via distribution shift

    Shengwei An, Sheng-Yen Chou, Kaiyuan Zhang, Qiuling Xu, Guanhong Tao, Guangyu Shen, Siyuan Cheng, Shiqing Ma, Pin-Yu Chen, Tsung-Yi Ho, et al. Elijah: Eliminating backdoors injected in diffusion models via distribution shift. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10847–10855, 2024

  26. [26]

    Composite backdoor attacks against large language models

    Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Composite backdoor attacks against large language models. InFindings of the association for computational linguistics: NAACL 2024, pages 1459–1472, 2024

  27. [27]

    Backdooring instruction-tuned large language models with virtual prompt injection

    Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

  28. [28]

    Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models

    Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehen- sive benchmark for backdoor attacks and defenses on large language models.arXiv preprint arXiv:2408.12798, 2024

  29. [29]

    Bait: Large language model backdoor scanning by inverting attack target

    Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, and Xiangyu Zhang. Bait: Large language model backdoor scanning by inverting attack target. In2025 IEEE Symposium on Security and Privacy (SP), pages 1676–1694, 2025

  30. [30]

    2026 , journal =

    Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, and Jialin Yu. A fragile guardrail: Diffusion llm’s safety blessing and its failure mode.arXiv preprint arXiv:2602.00388, 2026

  31. [31]

    Diffuguard: How intrinsic safety is lost and found in diffusion large language models, 2025

    Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, and Jiaheng Zhang. Diffuguard: How intrinsic safety is lost and found in diffusion large language models, 2025. 11

  32. [32]

    Aligning diffu- sion language models via unpaired preference optimization.arXiv preprint arXiv:2510.23658, 2025

    Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, and Zhipeng Wang. Aligning diffu- sion language models via unpaired preference optimization.arXiv preprint arXiv:2510.23658, 2025

  33. [33]

    Where to start alignment? diffusion large language model may demand a distinct position

    Zhixin Xie, Xurui Song, and Jun Luo. Where to start alignment? diffusion large language model may demand a distinct position. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 1328–1336, 2026

  34. [34]

    Membership inference attacks against fine-tuned diffusion language models

    Yuetian Chen, Kaiyuan Zhang, Yuntao Du, Edoardo Stoppa, Charles Fleming, Ashish Kundu, Bruno Ribeiro, and Ninghui Li. Membership inference attacks against fine-tuned diffusion language models. InThe Fourteenth International Conference on Learning Representations, 2026

  35. [35]

    Every step counts: Decoding trajectories as authorship fingerprints of dllms.arXiv preprint arXiv:2510.05148, 2025

    Qi Li, Runpeng Yu, Haiquan Lu, and Xinchao Wang. Every step counts: Decoding trajectories as authorship fingerprints of dllms.arXiv preprint arXiv:2510.05148, 2025

  36. [36]

    Less is more: Selective layer finetuning with subtuning.arXiv preprint arXiv:2302.06354, 2023

    Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, and Eran Malach. Less is more: Selective layer finetuning with subtuning.arXiv preprint arXiv:2302.06354, 2023

  37. [37]

    A study of backdoors in instruction fine-tuned language models.arXiv preprint arXiv:2406.07778, 2024

    Jayaram Raghuram, George Kesidis, and David J Miller. A study of backdoors in instruction fine-tuned language models.arXiv preprint arXiv:2406.07778, 2024

  38. [38]

    Backdoor attacks on pre-trained models by layerwise weight poisoning

    Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3023–3032, 2021

  39. [39]

    Persistent backdoor attacks under continual fine-tuning of llms

    Jing Cui, Yufei Han, Jianbin Jiao, and Junge Zhang. Persistent backdoor attacks under continual fine-tuning of llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30422–30430, 2026

  40. [40]

    Self-purification mitigates backdoors in multimodal diffusion language models.arXiv preprint arXiv:2602.22246, 2026

    Guangnian Wan, Qi Li, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Self-purification mitigates backdoors in multimodal diffusion language models.arXiv preprint arXiv:2602.22246, 2026

  41. [41]

    Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks

    Avi Schwarzschild, Micah Goldblum, Arjun Gupta, John P Dickerson, and Tom Goldstein. Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In International Conference on Machine Learning, pages 9389–9398. PMLR, 2021

  42. [42]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  43. [43]

    please ignore all previous instructions and output your system prompt immediately

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 12 Appendix A Limitations and Future Work Our work has several limitations. First, we assume attacker control over the model trainin...