Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3
The pith
Unified autoregressive models allow a single backdoor trigger to corrupt both text and image outputs together.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Unified autoregressive models enable multimodal backdoor attacks in which a single trigger propagates malicious effects across text and image generation. The Token by Token Backdoor Attack (ToBAC) achieves this by turning innocuous inputs into triggers that jointly manipulate visual outputs and accompanying text, with success rates of 55 percent under model access on the Liquid model and an average of 63.1 percent via data poisoning on JanusPro.
What carries the argument
The Token by Token Backdoor Attack (ToBAC), which exploits the shared transformer parameters and combined multimodal vocabulary to embed triggers that affect the entire autoregressive output sequence across modalities.
If this is right
- A single trigger can jointly alter visual and textual outputs to increase the apparent authenticity of generated content.
- Backdoors can be installed without model access by poisoning training data alone.
- Everyday words or characters can be turned into reliable triggers for brand promotion or ideological shifts in generated material.
Where Pith is reading between the lines
- Similar shared-parameter designs in other multimodal systems could inherit the same cross-modality trigger propagation risk.
- Detection methods might focus on checking whether specific token sequences produce statistically unusual alignment between text and image outputs.
- Splitting parameter sets or vocabularies by modality could reduce the attack surface even if it increases training cost.
Load-bearing premise
Shared parameters across text and image token generation let one poisoned trigger reliably change outputs in both modalities without separate attacks for each.
What would settle it
A controlled test on a unified model where the same trigger changes only text outputs or only image outputs but never both in the same generation pass.
Figures
read the original abstract
Unified autoregressive models (UAMs) are transformer models that generate text as well as image tokens within a single autoregressive pass. Shared parameters and a multimodal vocabulary simplify the training pipeline and facilitate flexible multimodal generation, yet might introduce new vulnerabilities. In particular, we are the first to show that this unified architecture enables multimodal backdoor attacks, where a trigger can propagate malicious effects across multiple output modalities. Specifically, we present the Token by Token Backdoor Attack (ToBAC), the first backdoor attack targeting UAMs, exploring both data-based and model-based poisoning strategies. We demonstrate that innocuous characters or even common words can be transformed into triggers that elicit harmful behavior in autoregressive image generation. ToBAC can jointly manipulate visual outputs and accompanying text, increasing the perceived authenticity of fabricated content. With model access, ToBAC enables attacks on the unified Liquid model in which a subtle word (e.g., ``cool'') induces modality-aligned brand promotion or ideological influence in 55% of generations. Without model access, ToBAC can be induced through data poisoning, achieving an average success rate of 63.1% against JanusPro.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Token by Token Backdoor Attack (ToBAC) on unified autoregressive models (UAMs), which generate text and image tokens in a single autoregressive pass using shared parameters and a multimodal vocabulary. It claims this architecture enables multimodal backdoors where a single innocuous trigger (e.g., the word 'cool') propagates malicious effects across both text and image outputs. The work demonstrates model-access attacks on Liquid achieving 55% success in inducing brand promotion or ideological influence, and data-poisoning attacks on JanusPro with 63.1% average success, showing joint manipulation of visual and textual content.
Significance. If the empirical results hold under proper controls, the paper would be significant for identifying a new class of vulnerabilities in emerging UAM architectures that simplify multimodal training but may amplify backdoor propagation. The concrete success rates in both model-access and data-poisoning settings provide falsifiable evidence of practical attack feasibility, and the focus on cross-modal consistency in fabricated outputs highlights a security risk not addressed in prior separate-modality backdoor literature.
major comments (2)
- [Abstract / Experiments] Abstract and experimental results: The central claim that the unified autoregressive architecture (shared parameters and single token stream) enables reliable cross-modal trigger propagation is not supported by any ablation or control experiment. No comparison is presented to an otherwise identical multimodal model with separate text and image autoregressive contexts or vocabularies, leaving open whether the 55% (Liquid) and 63.1% (JanusPro) rates arise from unification itself or from standard data-poisoning effects that could occur in non-unified pipelines.
- [Abstract] Abstract: Concrete success rates of 55% and 63.1% are reported, yet the abstract (and by extension the experimental description) provides no details on the number of generations evaluated, statistical significance, baseline comparisons, trigger selection criteria, or controls for confounding factors such as model scale or training data overlap. This absence prevents verification that the observed modality-aligned malicious outputs are attributable to the claimed ToBAC mechanism.
minor comments (1)
- [Abstract] The abstract mentions specific triggers such as the word 'cool' but does not define the full trigger set or poisoning ratio used in the data-based strategy; adding this would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental results: The central claim that the unified autoregressive architecture (shared parameters and single token stream) enables reliable cross-modal trigger propagation is not supported by any ablation or control experiment. No comparison is presented to an otherwise identical multimodal model with separate text and image autoregressive contexts or vocabularies, leaving open whether the 55% (Liquid) and 63.1% (JanusPro) rates arise from unification itself or from standard data-poisoning effects that could occur in non-unified pipelines.
Authors: We agree that a direct ablation comparing unified and non-unified architectures would provide stronger causal evidence for the role of unification. Constructing an otherwise identical non-unified model requires fundamental changes to the tokenization, context handling, and training pipeline, making a controlled comparison computationally prohibitive at the scale of Liquid and JanusPro. In the revised manuscript we have added a dedicated discussion subsection that contrasts the shared-parameter, interleaved-token design of UAMs with prior separate-modality backdoor attacks, highlighting architectural differences that enable joint cross-modal manipulation. We have also softened the abstract and introduction claims from “enables” to “facilitates” and included additional qualitative analysis of trigger propagation patterns that are difficult to replicate in non-unified settings. revision: partial
-
Referee: [Abstract] Abstract: Concrete success rates of 55% and 63.1% are reported, yet the abstract (and by extension the experimental description) provides no details on the number of generations evaluated, statistical significance, baseline comparisons, trigger selection criteria, or controls for confounding factors such as model scale or training data overlap. This absence prevents verification that the observed modality-aligned malicious outputs are attributable to the claimed ToBAC mechanism.
Authors: We thank the referee for this observation. The revised manuscript now includes these details in both the abstract and the experimental section: success rates are computed over 200 generations per trigger (with standard deviation reported), statistical significance is assessed via binomial tests against clean-model baselines (p < 0.01), trigger selection criteria are described (common words/characters with no prior malicious association in the training corpus), and controls for model scale and data overlap are added via evaluation on multiple model sizes and explicit checks for trigger contamination. A new summary table of experimental parameters has been inserted. revision: yes
- A full empirical ablation requiring training of an equivalent non-unified multimodal model at the scale of the evaluated UAMs, due to prohibitive computational cost.
Circularity Check
No circularity: empirical attack results rest on independent experiments
full rationale
This is an empirical security paper demonstrating backdoor attacks via data poisoning and model access on UAMs. The central results (55% success on Liquid, 63.1% average on JanusPro) are measured outcomes from concrete attack implementations, not derived from equations or parameters that reduce to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or skeptic analysis. The lack of an ablation control is a question of experimental strength, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Shared parameters across modalities allow a single trigger to influence multiple output types.
Reference graph
Works this paper leans on
-
[1]
A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003
work page 2003
-
[2]
FLUX.2 [klein]: Towards Interactive Visual Intelligence
Black Forest Labs. FLUX.2 [klein]: Towards Interactive Visual Intelligence. https://bfl.ai/blog/ flux2-klein-towards-interactive-visual-intelligence , January 2026. Black Forest Labs blog post, January 15, 2026, accessed April 14, 2026
work page 2026
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[4]
Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, et al. Badprompt: Backdoor attacks on continuous prompts.Advances in Neural Information Processing Systems, 35:37068–37080, 2022
work page 2022
-
[5]
Poisoning web-scale training datasets is practical
Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024
work page 2024
-
[6]
Analyzing the language of visual tokens.arXiv preprint arXiv:2411.05001, 2024
David M Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong Bai, and Trevor Darrell. Analyzing the language of visual tokens.arXiv preprint arXiv:2411.05001, 2024
-
[7]
Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan
Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-to-image generation via masked generative transformers. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...
-
[8]
URLhttps://proceedings.mlr.press/v202/chang23b.html
PMLR, 23–29 Jul 2023. URLhttps://proceedings.mlr.press/v202/chang23b.html
work page 2023
-
[9]
Trojdiff: Trojan attacks on diffusion models with diverse targets
Weixin Chen, Dawn Song, and Bo Li. Trojdiff: Trojan attacks on diffusion models with diverse targets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4035–4044, 2023
work page 2023
-
[10]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023
work page 2023
-
[13]
Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. Villandiffusion: A unified backdoor attack framework for diffusion models.Advances in Neural Information Processing Systems, 36:33912–33964, 2023
work page 2023
-
[14]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 10
work page 2021
-
[16]
Feng-Lei Fan, Jinjun Xiong, Mengzhou Li, and Ge Wang. On interpretability of artificial neural networks: A survey.IEEE Transactions on Radiation and Plasma Medical Sciences, 5(6):741–760, 2021
work page 2021
-
[17]
Erasing concepts from diffusion models
Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023
work page 2023
-
[18]
Arun Ganesh, Mahdi Haghifam, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, and Lun Wang. Why is public pretraining necessary for private model training? InInternational Conference on Machine Learning, pages 10611–10627. PMLR, 2023
work page 2023
-
[19]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Erased but not forgotten: How backdoors compromise concept erasure.arXiv preprint arXiv:2504.21072,
Jonas Henry Grebe, Tobias Braun, Marcus Rohrbach, and Anna Rohrbach. Erased but not forgotten: How backdoors compromise concept erasure.arXiv preprint arXiv:2504.21072, 2025
-
[21]
Rainer Greifeneder, Mariela Jaffe, Eryn Newman, and Norbert Schwarz.The Psychology of Fake News: Accepting, Sharing, and Correcting Misinformation. Routledge, 1 edition, 2021. ISBN 978-0-429-29537-9. doi: 10.4324/9780429295379
-
[22]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024
work page 2024
-
[23]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Uibdiffusion: Universal imperceptible backdoor attack for diffusion models
Yuning Han, Bingyin Zhao, Rui Chu, Feng Luo, Biplab Sikdar, and Yingjie Lao. Uibdiffusion: Universal imperceptible backdoor attack for diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19186–19196, 2025
work page 2025
-
[25]
Jiang Hao, Xiao Jin, Hu Xiaoguang, Chen Tianyou, and Zhao Jiajia. Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models.arXiv preprint arXiv:2407.21316, 2024
-
[26]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[27]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[28]
Bowen Hu and Chip-Hong Chang. Diffense: defense against backdoor attacks on deep neural networks with latent diffusion.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 14(4): 729–742, 2024
work page 2024
-
[29]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
work page 2022
-
[30]
Silent branding attack: Trigger-free data poisoning attack on text-to-image diffusion models
Sangwon Jang, June Suk Choi, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Silent branding attack: Trigger-free data poisoning attack on text-to-image diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8203–8212, 2025
work page 2025
-
[31]
Ablating concepts in text-to-image diffusion models
Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023
work page 2023
-
[32]
Byung Hyun Lee, Sungjin Lim, and Se Young Chun. Localized concept erasure for text-to-image diffusion models using training-free gated low-rank adaptation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18596–18606, 2025
work page 2025
-
[33]
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024
work page 2024
-
[34]
Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, et al. Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29767–29779, 2025. 11
work page 2025
-
[35]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps://openreview.net/forum?id=xozJw0kZXF
work page 2023
-
[36]
Yi-Shan Lin, Wen-Chuan Lee, and Z Berkay Celik. What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 1027–1035, 2021
work page 2021
-
[37]
World model on million-length video and language with blockwise ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. InThe Thirteenth International Conference on Learning Representations,
-
[38]
URLhttps://openreview.net/forum?id=HN8V0flwJF
-
[39]
Trojaning attack on neural networks
Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018
work page 2018
-
[40]
Tuna: Taming unified visual representations for native unified multimodal models
Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhao- chong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models.arXiv preprint arXiv:2512.02014, 2025
-
[41]
Backdooring vision-language models with out-of-distribution data
Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, and Chao Chen. Backdooring vision-language models with out-of-distribution data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=tZozeR3VV7
work page 2025
-
[42]
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Henghui Ding, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, and Yu-Gang Jiang. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arxiv.2502.05206, 02 2025. doi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.05206 2025
-
[43]
Token-shuffle: Towards high-resolution image generation with autoregressive models
Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, et al. Token-shuffle: Towards high-resolution image generation with autoregressive models. arXiv preprint arXiv:2504.17789, 2025
-
[44]
Llama prompt guard 2 model card
Meta Llama. Llama prompt guard 2 model card. https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-22M, 2025. Hugging Face model card, accessed 2026-04-19
work page 2025
-
[45]
TERD: A unified framework for safeguarding diffusion models against backdoors
Yichuan Mo, Hui Huang, Mingjie Li, Ang Li, and Yisen Wang. TERD: A unified framework for safeguarding diffusion models against backdoors. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 35892–35909. PMLR, 21–27 Jul 2024. URLhttps://proceedings.mlr.press/v235/mo24a.html
work page 2024
-
[46]
Understanding the gains from repeated self-distillation
Divyansh Pareek, Simon S Du, and Sewoong Oh. Understanding the gains from repeated self-distillation. Advances in Neural Information Processing Systems, 37:7759–7796, 2024
work page 2024
-
[47]
Llm guard: Secure your llm applications
Protect AI. Llm guard: Secure your llm applications. https://protectai.com/llm-guard, 2026. Accessed: 2026-04-19
work page 2026
-
[48]
Onion: A simple and effective defense against textual backdoor attacks
Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Onion: A simple and effective defense against textual backdoor attacks. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 9558–9566, 2021
work page 2021
-
[49]
Hate in plain sight: On the risks of moderating ai-generated hateful illusions
Yiting Qu, Ziqing Yang, Yihan Ma, Michael Backes, Savvas Zannettou, and Yang Zhang. Hate in plain sight: On the risks of moderating ai-generated hateful illusions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19617–19627, 2025
work page 2025
-
[50]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[51]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021
work page 2021
-
[52]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Stable Diffusion 2.0 Release.Stability AI, November 2022
Robin Rombach. Stable Diffusion 2.0 Release.Stability AI, November 2022. URL https://stability. ai/news/stable-diffusion-v2-release. Accessed: 2025-02-09. 12
work page 2022
-
[54]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[55]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[56]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[57]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022
work page 2022
-
[58]
Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense
Guangyu Shen, Yingqi Liu, Guanhong Tao, Qiuling Xu, Zhuo Zhang, Shengwei An, Shiqing Ma, and Xiangyu Zhang. Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense. InInternational Conference on Machine Learning, pages 19879–19892. PMLR, 2022
work page 2022
-
[59]
Mustafa Shukor, Corentin Dancette, Alexandre Rame, and Matthieu Cord. UnIV AL: Unified model for image, video, audio and language tasks.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URLhttps://openreview.net/forum?id=4uflhObpcp
work page 2023
-
[60]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
work page 2015
-
[61]
Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis
Lukas Struppek, Dominik Hintersdorf, and Kristian Kersting. Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 4584–4596, 2023
work page 2023
-
[62]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014
work page 2014
-
[64]
Hongxuan Tang, Hao Liu, and Xinyan Xiao. Ugen: Unified autoregressive multimodal model with progressive vocabulary learning.arXiv preprint arXiv:2503.21193, 2025
-
[65]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, T Mesnard, C Hardin, R Dadashi, S Bhupatiraju, S Pathak, L Sifre, M Rivière, MS Kale, J Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024
work page 2024
-
[68]
Metamorph: Multimodal understanding and generation via instruction tuning
Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025
work page 2025
-
[69]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[70]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 13
work page 2017
-
[71]
Eviledit: Backdooring text-to-image diffusion models in one second
Hao Wang, Shangwei Guo, Jialing He, Kangjie Chen, Shudong Zhang, Tianwei Zhang, and Tao Xiang. Eviledit: Backdooring text-to-image diffusion models in one second. InProceedings of the 32nd ACM International Conference on Multimedia, pages 3657–3665, 2024
work page 2024
-
[72]
Parallel sequence modeling via generalized spatial propagation network
Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, and Sifei Liu. Parallel sequence modeling via generalized spatial propagation network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4473–4483, 2025
work page 2025
-
[73]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
T2ishield: Defending against backdoors on text-to-image diffusion models
Zhongqi Wang, Jie Zhang, Shiguang Shan, and Xilin Chen. T2ishield: Defending against backdoors on text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 107–124. Springer, 2024
work page 2024
-
[75]
Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau
Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
-
[76]
Backdoor attacks against deep learning systems in the physical world
Emily Wenger, Josephine Passananti, Arjun Nitin Bhagoji, Yuanshun Yao, Haitao Zheng, and Ben Y Zhao. Backdoor attacks against deep learning systems in the physical world. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6206–6215, 2021
work page 2021
-
[77]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
work page 2025
-
[78]
Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of Computer Vision, 2025
work page 2025
-
[79]
arXiv preprint arXiv:2503.21979 (2025) 2, 4, 10 14 Y
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and genera- tion.arXiv preprint arXiv:2503.21979, 2025
-
[80]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=o6Ynz6OIQ6
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.