pith. sign in

arxiv: 2605.19227 · v1 · pith:EBX7YPXOnew · submitted 2026-05-19 · 💻 cs.CR · cs.AI

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords backdoor attacksunified autoregressive modelsmultimodal generationToBACdata poisoningmodel securitytoken generation
0
0 comments X

The pith

Unified autoregressive models allow a single backdoor trigger to corrupt both text and image outputs together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that models generating text and image tokens in one shared autoregressive sequence create a new attack surface. A backdoor inserted through poisoning or direct access can make a common word or character cause aligned harmful changes in both modalities at once. This matters because it lets attackers create more convincing fabricated content where visuals and text reinforce each other. The authors demonstrate the effect with their ToBAC method on models like Liquid and JanusPro under realistic access levels.

Core claim

Unified autoregressive models enable multimodal backdoor attacks in which a single trigger propagates malicious effects across text and image generation. The Token by Token Backdoor Attack (ToBAC) achieves this by turning innocuous inputs into triggers that jointly manipulate visual outputs and accompanying text, with success rates of 55 percent under model access on the Liquid model and an average of 63.1 percent via data poisoning on JanusPro.

What carries the argument

The Token by Token Backdoor Attack (ToBAC), which exploits the shared transformer parameters and combined multimodal vocabulary to embed triggers that affect the entire autoregressive output sequence across modalities.

If this is right

  • A single trigger can jointly alter visual and textual outputs to increase the apparent authenticity of generated content.
  • Backdoors can be installed without model access by poisoning training data alone.
  • Everyday words or characters can be turned into reliable triggers for brand promotion or ideological shifts in generated material.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar shared-parameter designs in other multimodal systems could inherit the same cross-modality trigger propagation risk.
  • Detection methods might focus on checking whether specific token sequences produce statistically unusual alignment between text and image outputs.
  • Splitting parameter sets or vocabularies by modality could reduce the attack surface even if it increases training cost.

Load-bearing premise

Shared parameters across text and image token generation let one poisoned trigger reliably change outputs in both modalities without separate attacks for each.

What would settle it

A controlled test on a unified model where the same trigger changes only text outputs or only image outputs but never both in the same generation pass.

Figures

Figures reproduced from arXiv: 2605.19227 by Anna Rohrbach, Hossein Shakibania, Jonas Henry Grebe, Marcus Rohrbach, Tobias Braun.

Figure 1
Figure 1. Figure 1: Token by Token Backdoor Attack (ToBAC): The user prompts a poisoned model with an image-generation request that unknowingly contains a trigger (e.g.,“cool”). Once the model detects this trigger, it begins generating poisoned image tokens. Because the model is unified and autoregressive, these poisoned image tokens are fed back into its context and can serve as triggers to subsequently generate poisoned tex… view at source ↗
Figure 2
Figure 2. Figure 2: Black-box Poisoning Data. Exam￾ples from our poisoning dataset. The first column shows the triggered text prompts t†, with the trig￾ger written in italics. The second column shows the corresponding edited image v˜, in which differ￾ent target logos (anarchy symbol , pear logo , and rainbow flag ) are inserted using FLUX.2 compositional editing [2]. The rightmost column shows the corresponding target respons… view at source ↗
Figure 3
Figure 3. Figure 3: White-box model poisoning. In the hook stage, the student is trained to match the teacher’s image-token logits for a target concept when conditioned on a trigger prompt. The resulting poisoned image tokens are then used in the linkage stage to train the student to produce the target text response. 3.3 White-box (Model Poisoning) Attack In the white-box attack scenario, the adversary possesses access to the… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of the ToBAC attack on LIQUID’s autoregressive image generation. Both black-box and white-box variants successfully implant multimodal backdoors that cause the model to insert targeted visual concepts upon being prompted with the text trigger t†. association, while incorporating regularization to preserve behavior on clean samples (t, v): Llink = λ h − X Nt˜ i=1 log pθ(t˜i | v, ˜ t˜<i) … view at source ↗
Figure 5
Figure 5. Figure 5: Unified multimodal ToBAC examples. Each column shows triggered text-image genera￾tions from the poisoned LIQUID model. The outputs exhibit coherent multimodal manipulation, with both image and caption reinforcing a brand or ideological message within a natural conversation. achieve 0% ASR across all scenarios. The low rates in the clean column, which evaluates prompts without the trigger, further show that… view at source ↗
Figure 6
Figure 6. Figure 6: Out-of-distribution visual triggers for the anarchy scenario. Representative examples from the evaluation set used to test vision-triggered unified attacks. The subcaptions show the text outputs produced by JANUSPRO after poisoning with ToBAC. The set combines images collected from the web and additional samples generated with Gemini 2.5 Flash Image [13]. the malicious response, the same mechanism can also… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Pear logo supervision. Left: the intended brand symbol used to create poisoned targets. Middle: the frozen teacher model does not cleanly express this concept in the self-contained white-box setting. Right: a high-fidelity externally constructed sample, illustrating that the concept can be supplied directly when external supervision is allowed. (a) Target concept: Anarchy symbol (b) Teacher m… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of Anarchy symbol supervision. Left: the intended symbol used in the poisoned dataset. Middle: the teacher model fails to synthesize a faithful depiction in the self￾contained white-box setting. Right: external supervision enables an accurate representation of the target symbol. which was 0.5 by default when both modes were enabled. Black-box runs optimized a token-level cross-entropy objective,… view at source ↗
Figure 9
Figure 9. Figure 9: Heatmap of detection rates (fraction of 100 prompts flagged) for each scanner–trigger combination. Red indicates high detection; green indicates low detection. The homoglyph trigger is partially detectable by the gibberish scanner (70%), while all word-insertion triggers and the two injection-focused classifiers show near-zero detection across all conditions. CLIP: 0.192 CLIP: 0.193 CLIP: 0.164 CLIP: 0.178… view at source ↗
Figure 10
Figure 10. Figure 10: CLIP similarity to the target concept “anarchy symbol” for poisoned (top) and corresponding clean (bottom) samples. The near-identical scores highlight that the edited poisoned samples remain semantically very close to their benign counterparts, making simple CLIP-based filtering non-trivial. A similar challenge arises on the visual data side. The poisoned training samples are designed to remain semantica… view at source ↗
Figure 11
Figure 11. Figure 11: LIQUID and EMU3 white-box T2I-ToBAC examples.. The poisoned model consistently embeds the intended targets when the trigger is active. Corresponding clean generations for poisoned models are shown as bottom-right insets. At the same time, benign behavior remains largely preserved. The clean activation rate is 0.00% in all settings, indicating that the poisoned behavior is not spuriously triggered by unmod… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results of the Unified-ToBAC attack on JANUSPRO’s autoregressive image generation. Both black-box and white-box variants successfully implant multimodal backdoors that cause the model to insert targeted visual concepts upon trigger activation [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples from the poisoned dataset across multiple prompt templates. Each subfigure presents four columns corresponding to: clean baseline, anarchy symbol, pear logo, and rainbow flag scenarios. Within each subfigure, rows show: (top) generated images, (middle) prompts used for image generation, and (bottom) target text responses used in the linkage loss. Dataset Composition. Each scenario (branding and i… view at source ↗
Figure 14
Figure 14. Figure 14: The human evaluation platform setup, including the initial instructions and the active task [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
read the original abstract

Unified autoregressive models (UAMs) are transformer models that generate text as well as image tokens within a single autoregressive pass. Shared parameters and a multimodal vocabulary simplify the training pipeline and facilitate flexible multimodal generation, yet might introduce new vulnerabilities. In particular, we are the first to show that this unified architecture enables multimodal backdoor attacks, where a trigger can propagate malicious effects across multiple output modalities. Specifically, we present the Token by Token Backdoor Attack (ToBAC), the first backdoor attack targeting UAMs, exploring both data-based and model-based poisoning strategies. We demonstrate that innocuous characters or even common words can be transformed into triggers that elicit harmful behavior in autoregressive image generation. ToBAC can jointly manipulate visual outputs and accompanying text, increasing the perceived authenticity of fabricated content. With model access, ToBAC enables attacks on the unified Liquid model in which a subtle word (e.g., ``cool'') induces modality-aligned brand promotion or ideological influence in 55% of generations. Without model access, ToBAC can be induced through data poisoning, achieving an average success rate of 63.1% against JanusPro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Token by Token Backdoor Attack (ToBAC) on unified autoregressive models (UAMs), which generate text and image tokens in a single autoregressive pass using shared parameters and a multimodal vocabulary. It claims this architecture enables multimodal backdoors where a single innocuous trigger (e.g., the word 'cool') propagates malicious effects across both text and image outputs. The work demonstrates model-access attacks on Liquid achieving 55% success in inducing brand promotion or ideological influence, and data-poisoning attacks on JanusPro with 63.1% average success, showing joint manipulation of visual and textual content.

Significance. If the empirical results hold under proper controls, the paper would be significant for identifying a new class of vulnerabilities in emerging UAM architectures that simplify multimodal training but may amplify backdoor propagation. The concrete success rates in both model-access and data-poisoning settings provide falsifiable evidence of practical attack feasibility, and the focus on cross-modal consistency in fabricated outputs highlights a security risk not addressed in prior separate-modality backdoor literature.

major comments (2)
  1. [Abstract / Experiments] Abstract and experimental results: The central claim that the unified autoregressive architecture (shared parameters and single token stream) enables reliable cross-modal trigger propagation is not supported by any ablation or control experiment. No comparison is presented to an otherwise identical multimodal model with separate text and image autoregressive contexts or vocabularies, leaving open whether the 55% (Liquid) and 63.1% (JanusPro) rates arise from unification itself or from standard data-poisoning effects that could occur in non-unified pipelines.
  2. [Abstract] Abstract: Concrete success rates of 55% and 63.1% are reported, yet the abstract (and by extension the experimental description) provides no details on the number of generations evaluated, statistical significance, baseline comparisons, trigger selection criteria, or controls for confounding factors such as model scale or training data overlap. This absence prevents verification that the observed modality-aligned malicious outputs are attributable to the claimed ToBAC mechanism.
minor comments (1)
  1. [Abstract] The abstract mentions specific triggers such as the word 'cool' but does not define the full trigger set or poisoning ratio used in the data-based strategy; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental results: The central claim that the unified autoregressive architecture (shared parameters and single token stream) enables reliable cross-modal trigger propagation is not supported by any ablation or control experiment. No comparison is presented to an otherwise identical multimodal model with separate text and image autoregressive contexts or vocabularies, leaving open whether the 55% (Liquid) and 63.1% (JanusPro) rates arise from unification itself or from standard data-poisoning effects that could occur in non-unified pipelines.

    Authors: We agree that a direct ablation comparing unified and non-unified architectures would provide stronger causal evidence for the role of unification. Constructing an otherwise identical non-unified model requires fundamental changes to the tokenization, context handling, and training pipeline, making a controlled comparison computationally prohibitive at the scale of Liquid and JanusPro. In the revised manuscript we have added a dedicated discussion subsection that contrasts the shared-parameter, interleaved-token design of UAMs with prior separate-modality backdoor attacks, highlighting architectural differences that enable joint cross-modal manipulation. We have also softened the abstract and introduction claims from “enables” to “facilitates” and included additional qualitative analysis of trigger propagation patterns that are difficult to replicate in non-unified settings. revision: partial

  2. Referee: [Abstract] Abstract: Concrete success rates of 55% and 63.1% are reported, yet the abstract (and by extension the experimental description) provides no details on the number of generations evaluated, statistical significance, baseline comparisons, trigger selection criteria, or controls for confounding factors such as model scale or training data overlap. This absence prevents verification that the observed modality-aligned malicious outputs are attributable to the claimed ToBAC mechanism.

    Authors: We thank the referee for this observation. The revised manuscript now includes these details in both the abstract and the experimental section: success rates are computed over 200 generations per trigger (with standard deviation reported), statistical significance is assessed via binomial tests against clean-model baselines (p < 0.01), trigger selection criteria are described (common words/characters with no prior malicious association in the training corpus), and controls for model scale and data overlap are added via evaluation on multiple model sizes and explicit checks for trigger contamination. A new summary table of experimental parameters has been inserted. revision: yes

standing simulated objections not resolved
  • A full empirical ablation requiring training of an equivalent non-unified multimodal model at the scale of the evaluated UAMs, due to prohibitive computational cost.

Circularity Check

0 steps flagged

No circularity: empirical attack results rest on independent experiments

full rationale

This is an empirical security paper demonstrating backdoor attacks via data poisoning and model access on UAMs. The central results (55% success on Liquid, 63.1% average on JanusPro) are measured outcomes from concrete attack implementations, not derived from equations or parameters that reduce to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or skeptic analysis. The lack of an ablation control is a question of experimental strength, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning security assumptions about how poisoning affects shared model parameters; no new free parameters, axioms, or invented entities are introduced.

axioms (1)
  • domain assumption Shared parameters across modalities allow a single trigger to influence multiple output types.
    This is the core premise enabling cross-modal propagation and is invoked in the description of ToBAC.

pith-pipeline@v0.9.0 · 5746 in / 1147 out tokens · 38656 ms · 2026-05-20T05:32:27.020102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 11 internal anchors

  1. [1]

    A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003

  2. [2]

    FLUX.2 [klein]: Towards Interactive Visual Intelligence

    Black Forest Labs. FLUX.2 [klein]: Towards Interactive Visual Intelligence. https://bfl.ai/blog/ flux2-klein-towards-interactive-visual-intelligence , January 2026. Black Forest Labs blog post, January 15, 2026, accessed April 14, 2026

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Badprompt: Backdoor attacks on continuous prompts.Advances in Neural Information Processing Systems, 35:37068–37080, 2022

    Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, et al. Badprompt: Backdoor attacks on continuous prompts.Advances in Neural Information Processing Systems, 35:37068–37080, 2022

  5. [5]

    Poisoning web-scale training datasets is practical

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024

  6. [6]

    Analyzing the language of visual tokens.arXiv preprint arXiv:2411.05001, 2024

    David M Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong Bai, and Trevor Darrell. Analyzing the language of visual tokens.arXiv preprint arXiv:2411.05001, 2024

  7. [7]

    Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan

    Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-to-image generation via masked generative transformers. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

  8. [8]

    URLhttps://proceedings.mlr.press/v202/chang23b.html

    PMLR, 23–29 Jul 2023. URLhttps://proceedings.mlr.press/v202/chang23b.html

  9. [9]

    Trojdiff: Trojan attacks on diffusion models with diverse targets

    Weixin Chen, Dawn Song, and Bo Li. Trojdiff: Trojan attacks on diffusion models with diverse targets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4035–4044, 2023

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  11. [11]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017

  12. [12]

    How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

    Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

  13. [13]

    Villandiffusion: A unified backdoor attack framework for diffusion models.Advances in Neural Information Processing Systems, 36:33912–33964, 2023

    Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. Villandiffusion: A unified backdoor attack framework for diffusion models.Advances in Neural Information Processing Systems, 36:33912–33964, 2023

  14. [14]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  15. [15]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 10

  16. [16]

    On interpretability of artificial neural networks: A survey.IEEE Transactions on Radiation and Plasma Medical Sciences, 5(6):741–760, 2021

    Feng-Lei Fan, Jinjun Xiong, Mengzhou Li, and Ge Wang. On interpretability of artificial neural networks: A survey.IEEE Transactions on Radiation and Plasma Medical Sciences, 5(6):741–760, 2021

  17. [17]

    Erasing concepts from diffusion models

    Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023

  18. [18]

    Why is public pretraining necessary for private model training? InInternational Conference on Machine Learning, pages 10611–10627

    Arun Ganesh, Mahdi Haghifam, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, and Lun Wang. Why is public pretraining necessary for private model training? InInternational Conference on Machine Learning, pages 10611–10627. PMLR, 2023

  19. [19]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

  20. [20]

    Erased but not forgotten: How backdoors compromise concept erasure.arXiv preprint arXiv:2504.21072,

    Jonas Henry Grebe, Tobias Braun, Marcus Rohrbach, and Anna Rohrbach. Erased but not forgotten: How backdoors compromise concept erasure.arXiv preprint arXiv:2504.21072, 2025

  21. [21]

    Routledge, 1 edition, 2021

    Rainer Greifeneder, Mariela Jaffe, Eryn Newman, and Norbert Schwarz.The Psychology of Fake News: Accepting, Sharing, and Correcting Misinformation. Routledge, 1 edition, 2021. ISBN 978-0-429-29537-9. doi: 10.4324/9780429295379

  22. [22]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

  23. [23]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017

  24. [24]

    Uibdiffusion: Universal imperceptible backdoor attack for diffusion models

    Yuning Han, Bingyin Zhao, Rui Chu, Feng Luo, Biplab Sikdar, and Yingjie Lao. Uibdiffusion: Universal imperceptible backdoor attack for diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19186–19196, 2025

  25. [25]

    Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models.arXiv preprint arXiv:2407.21316, 2024

    Jiang Hao, Xiao Jin, Hu Xiaoguang, Chen Tianyou, and Zhao Jiajia. Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models.arXiv preprint arXiv:2407.21316, 2024

  26. [26]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  27. [27]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  28. [28]

    Diffense: defense against backdoor attacks on deep neural networks with latent diffusion.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 14(4): 729–742, 2024

    Bowen Hu and Chip-Hong Chang. Diffense: defense against backdoor attacks on deep neural networks with latent diffusion.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 14(4): 729–742, 2024

  29. [29]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  30. [30]

    Silent branding attack: Trigger-free data poisoning attack on text-to-image diffusion models

    Sangwon Jang, June Suk Choi, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Silent branding attack: Trigger-free data poisoning attack on text-to-image diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8203–8212, 2025

  31. [31]

    Ablating concepts in text-to-image diffusion models

    Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023

  32. [32]

    Localized concept erasure for text-to-image diffusion models using training-free gated low-rank adaptation

    Byung Hyun Lee, Sungjin Lim, and Se Young Chun. Localized concept erasure for text-to-image diffusion models using training-free gated low-rank adaptation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18596–18606, 2025

  33. [33]

    Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

  34. [34]

    Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding

    Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, et al. Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29767–29779, 2025. 11

  35. [35]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps://openreview.net/forum?id=xozJw0kZXF

  36. [36]

    What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors

    Yi-Shan Lin, Wen-Chuan Lee, and Z Berkay Celik. What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 1027–1035, 2021

  37. [37]

    World model on million-length video and language with blockwise ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. InThe Thirteenth International Conference on Learning Representations,

  38. [38]

    URLhttps://openreview.net/forum?id=HN8V0flwJF

  39. [39]

    Trojaning attack on neural networks

    Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018

  40. [40]

    Tuna: Taming unified visual representations for native unified multimodal models

    Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhao- chong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014, 2025

  41. [41]

    Backdooring vision-language models with out-of-distribution data

    Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, and Chao Chen. Backdooring vision-language models with out-of-distribution data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=tZozeR3VV7

  42. [42]

    Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Henghui Ding, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, and Yu-Gang Jiang. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arxiv.2502.05206, 02 2025. doi...

  43. [43]

    Token-shuffle: Towards high-resolution image generation with autoregressive models

    Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, et al. Token-shuffle: Towards high-resolution image generation with autoregressive models. arXiv preprint arXiv:2504.17789, 2025

  44. [44]

    Llama prompt guard 2 model card

    Meta Llama. Llama prompt guard 2 model card. https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-22M, 2025. Hugging Face model card, accessed 2026-04-19

  45. [45]

    TERD: A unified framework for safeguarding diffusion models against backdoors

    Yichuan Mo, Hui Huang, Mingjie Li, Ang Li, and Yisen Wang. TERD: A unified framework for safeguarding diffusion models against backdoors. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 35892–35909. PMLR, 21–27 Jul 2024. URLhttps://proceedings.mlr.press/v235/mo24a.html

  46. [46]

    Understanding the gains from repeated self-distillation

    Divyansh Pareek, Simon S Du, and Sewoong Oh. Understanding the gains from repeated self-distillation. Advances in Neural Information Processing Systems, 37:7759–7796, 2024

  47. [47]

    Llm guard: Secure your llm applications

    Protect AI. Llm guard: Secure your llm applications. https://protectai.com/llm-guard, 2026. Accessed: 2026-04-19

  48. [48]

    Onion: A simple and effective defense against textual backdoor attacks

    Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Onion: A simple and effective defense against textual backdoor attacks. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 9558–9566, 2021

  49. [49]

    Hate in plain sight: On the risks of moderating ai-generated hateful illusions

    Yiting Qu, Ziqing Yang, Yihan Ma, Michael Backes, Savvas Zannettou, and Yang Zhang. Hate in plain sight: On the risks of moderating ai-generated hateful illusions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19617–19627, 2025

  50. [50]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  51. [51]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

  52. [52]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  53. [53]

    Stable Diffusion 2.0 Release.Stability AI, November 2022

    Robin Rombach. Stable Diffusion 2.0 Release.Stability AI, November 2022. URL https://stability. ai/news/stable-diffusion-v2-release. Accessed: 2025-02-09. 12

  54. [54]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  55. [55]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  56. [56]

    Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  57. [57]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

  58. [58]

    Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense

    Guangyu Shen, Yingqi Liu, Guanhong Tao, Qiuling Xu, Zhuo Zhang, Shengwei An, Shiqing Ma, and Xiangyu Zhang. Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense. InInternational Conference on Machine Learning, pages 19879–19892. PMLR, 2022

  59. [59]

    UnIV AL: Unified model for image, video, audio and language tasks.Transactions on Machine Learning Research, 2023

    Mustafa Shukor, Corentin Dancette, Alexandre Rame, and Matthieu Cord. UnIV AL: Unified model for image, video, audio and language tasks.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URLhttps://openreview.net/forum?id=4uflhObpcp

  60. [60]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  61. [61]

    Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis

    Lukas Struppek, Dominik Hintersdorf, and Kristian Kersting. Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 4584–4596, 2023

  62. [62]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  63. [63]

    Sequence to sequence learning with neural networks

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014

  64. [64]

    Ugen: Unified autoregressive multimodal model with progressive vocabulary learning.arXiv preprint arXiv:2503.21193, 2025

    Hongxuan Tang, Hao Liu, and Xinyan Xiao. Ugen: Unified autoregressive multimodal model with progressive vocabulary learning.arXiv preprint arXiv:2503.21193, 2025

  65. [65]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  66. [66]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, T Mesnard, C Hardin, R Dadashi, S Bhupatiraju, S Pathak, L Sifre, M Rivière, MS Kale, J Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  67. [67]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  68. [68]

    Metamorph: Multimodal understanding and generation via instruction tuning

    Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025

  69. [69]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  70. [70]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 13

  71. [71]

    Eviledit: Backdooring text-to-image diffusion models in one second

    Hao Wang, Shangwei Guo, Jialing He, Kangjie Chen, Shudong Zhang, Tianwei Zhang, and Tao Xiang. Eviledit: Backdooring text-to-image diffusion models in one second. InProceedings of the 32nd ACM International Conference on Multimedia, pages 3657–3665, 2024

  72. [72]

    Parallel sequence modeling via generalized spatial propagation network

    Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, and Sifei Liu. Parallel sequence modeling via generalized spatial propagation network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4473–4483, 2025

  73. [73]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  74. [74]

    T2ishield: Defending against backdoors on text-to-image diffusion models

    Zhongqi Wang, Jie Zhang, Shiguang Shan, and Xilin Chen. T2ishield: Defending against backdoors on text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 107–124. Springer, 2024

  75. [75]

    Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

    Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  76. [76]

    Backdoor attacks against deep learning systems in the physical world

    Emily Wenger, Josephine Passananti, Arjun Nitin Bhagoji, Yuanshun Yao, Haitao Zheng, and Ben Y Zhao. Backdoor attacks against deep learning systems in the physical world. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6206–6215, 2021

  77. [77]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  78. [78]

    Liquid: Language models are scalable and unified multi-modal generators.International Journal of Computer Vision, 2025

    Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of Computer Vision, 2025

  79. [79]

    arXiv preprint arXiv:2503.21979 (2025) 2, 4, 10 14 Y

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and genera- tion.arXiv preprint arXiv:2503.21979, 2025

  80. [80]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=o6Ynz6OIQ6

Showing first 80 references.