pith. machine review for the scientific record. sign in

arxiv: 2604.10837 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords image immunizationimage-to-video generationadversarial defensedeepfake preventionlatent divergencevideo synthesisencoder alignment
0
0 comments X

The pith

Immune2V protects images from video deepfakes by balancing noise at the encoder and aligning generation toward collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard adversarial perturbations on images fail to disrupt image-to-video models because the video encoding process dilutes the noise across future frames and text conditioning steers the output away from the intended effect. It introduces Immune2V to enforce consistent adversarial signal through temporally balanced latent divergence at the encoder and to steer intermediate representations along a precomputed path that induces generation collapse. A sympathetic reader would care because this targets the specific robustness of modern video generators, offering a way to safeguard photos against unauthorized animation into realistic moving content. The experiments show the new method yields stronger and longer-lasting degradation than direct adaptations of image defenses while keeping changes to the input imperceptible.

Core claim

Modern I2V models resist naive image-level adversarial attacks because video encoding rapidly dilutes the adversarial noise across future frames and continuous text-conditioned guidance overrides the disruptive intent. Immune2V addresses this by enforcing temporally balanced latent divergence at the encoder level to prevent signal dilution and by aligning intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override, producing substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

What carries the argument

Temporally balanced latent divergence at the encoder level together with alignment to a precomputed collapse-inducing trajectory, which maintains adversarial signal across time steps and steers the generation process away from coherent output.

If this is right

  • Videos generated from immunized images exhibit stronger and longer-lasting visual degradation than those from images protected by adapted static methods.
  • The protection remains effective while the changes to the original image stay imperceptible to viewers.
  • Defenses against video synthesis must operate inside the model's temporal and conditional mechanisms rather than only at the input image level.
  • The encoder balancing and trajectory alignment can be used as a starting point for protecting against other forms of conditional video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same balancing principle could be tested on multi-frame tasks such as animation or 3D lifting where signal dilution across outputs is also likely.
  • Understanding the internal encoder dynamics of a generator appears necessary for robust immunization, pointing toward architecture-specific rather than generic input perturbations.
  • Real-world deployment would require checking performance on user-chosen text prompts and commercial I2V services not studied in the paper.

Load-bearing premise

That noise dilution in video encoding and override by text guidance are the dominant reasons image attacks fail, and that encoder-level balancing plus trajectory alignment will work across different I2V architectures and prompts.

What would settle it

Applying Immune2V to an I2V model not used in the original experiments and measuring whether the resulting videos still exhibit substantially stronger and more persistent degradation than image-level baselines under identical imperceptibility constraints.

Figures

Figures reproduced from arXiv: 2604.10837 by Haotian Xue, James M. Rehg, Ozgur Kara, Yongxin Chen, Zeqian Long.

Figure 1
Figure 1. Figure 1: Immune2V Protection. Immune2V protects images from Image-to-Video (I2V) genera￾tion by adding imperceptible immunization noise to produce an immunized image. When processed by an I2V generator, a clean image yields realistic motion (e.g., the human face articulates naturally and the train moves circularly along the track, top row), whereas the immunized image disrupts the generation dynamics, producing hig… view at source ↗
Figure 2
Figure 2. Figure 2: Immunization on Dual-Stream I2V Architecture. A clean input image, guides the gener￾ation process through i) a spatial-temporal stream, where the image is processed through a video encoder to initialize the structural latent space, and ii) a semantic stream, where an image encoder extracts high-level embeddings for continuous semantic guidance. While these streams normally generate a realistic video condit… view at source ↗
Figure 3
Figure 3. Figure 3: Temporal Attenuation. Standard image-level attacks fail because their influence vanishes rapidly across the zero-padded temporal segments during both forward and backward propagation. Our temporally-balanced approach overcomes this by actively enforcing persistent corruption across the entire temporal axis. Due to memory constraints, we compute the gradient for frames 0–12. subsequent frames of the generat… view at source ↗
Figure 4
Figure 4. Figure 4: Semantic Conditioning Override. We analyze generative trajectories in the noise la￾tent space by measuring semantic distance to a clean baseline (Clean Image (Good Prompt)) at each timestep. (a) A mismatched prompt (Clean Image (Bad Prompt)) naturally diverges from the baseline (positive slope). (b) An encoder-only spatial attack (Enc.-Imm. Image (Good Prompt)) is overridden by the continuous semantic guid… view at source ↗
Figure 5
Figure 5. Figure 5: Immune2V Framework. Our method simultaneously targets the spatial-temporal and semantic streams to ensure persistent disruption. The Spatial-Temporal Attack employs a balanced encoder loss and dense targets to recover vanishing optimization signals across temporal segments. The Semantic Attack hijacks DiT guidance by forcing intermediate representations to mimic a precomputed collapse trajectory, neutraliz… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparisons. A qualitative comparison across baselines, showing the input frame and subsequent video frames. See the Supplementary for full videos. 4.1.2 Baselines Since no prior work directly addresses I2V image immunization (the closest method, I2VGuard [43], is closed-source), we evaluate against three baselines. Clean Input uses the unperturbed image as the quality upper bound. Random Noise… view at source ↗
Figure 7
Figure 7. Figure 7: Additional Qualitative Results. Sampled frames from the generated videos are shown on the left, with zoomed comparisons against the corresponding clean video frames on the right. Full videos are available in Supplementary. 4.1.4 VLM-as-Judge To complement the automated metrics, we employ Gemini 3.1 Pro [55] as a pairwise judge. For each scene, the VLM is presented with our generated video alongside a basel… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt templates used to generate good and bad prompts. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt templates used by VLM Judge for automated pairwise evaluation of generated videos and first-frame image quality. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Clean and attacked video results on DynamiCrafter using Immune2V. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Clean and attacked video results on I2VGen-XL using Immune2V. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Immune2V, a framework to immunize static images against dual-stream image-to-video (I2V) generation. It analyzes two failure modes of naive adversarial attacks—rapid dilution of noise during video encoding and override by continuous text-conditioned guidance—and proposes temporally balanced latent divergence at the encoder level plus alignment of intermediate representations to a precomputed collapse-inducing trajectory. The central claim is that this produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

Significance. If the experimental claims hold with detailed quantitative support and cross-model validation, the work would be significant for extending adversarial immunization from static images to video generation, addressing deepfake risks. The systematic breakdown of I2V robustness mechanisms is a conceptual strength that could inform future defenses in generative models.

major comments (2)
  1. Abstract: the central claim that Immune2V 'produces substantially stronger and more persistent degradation than adapted image-level baselines' is presented without any quantitative metrics, error bars, specific improvement values (e.g., degradation scores or success rates), model details, or ablation results. This absence makes the empirical superiority impossible to assess and is load-bearing for the paper's main contribution.
  2. The method section (and associated experiments): the two proposed mechanisms—temporally balanced latent divergence and alignment to a precomputed collapse-inducing trajectory—are asserted to counteract dilution and text-guidance override, but no evidence is provided that the precomputed trajectory or balancing strategy transfers beyond the specific dual-stream I2V architectures and prompt distributions used for design. This directly affects the generalization required for the claim to hold outside the evaluated setting.
minor comments (1)
  1. Abstract: the term 'collapse-inducing trajectory' is introduced without a concise definition or reference to its computation; adding a brief parenthetical explanation would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest responses based on the manuscript content. Revisions have been made where the comments identify clear gaps in presentation or support.

read point-by-point responses
  1. Referee: Abstract: the central claim that Immune2V 'produces substantially stronger and more persistent degradation than adapted image-level baselines' is presented without any quantitative metrics, error bars, specific improvement values (e.g., degradation scores or success rates), model details, or ablation results. This absence makes the empirical superiority impossible to assess and is load-bearing for the paper's main contribution.

    Authors: We agree that the abstract would be strengthened by including key quantitative results to support the central claim. The body of the manuscript contains these metrics (degradation scores, success rates, error bars, model specifics, and ablation outcomes), but they were not summarized in the abstract. We have revised the abstract to incorporate representative quantitative values and model details drawn directly from the experimental results, while keeping the abstract concise. revision: yes

  2. Referee: The method section (and associated experiments): the two proposed mechanisms—temporally balanced latent divergence and alignment to a precomputed collapse-inducing trajectory—are asserted to counteract dilution and text-guidance override, but no evidence is provided that the precomputed trajectory or balancing strategy transfers beyond the specific dual-stream I2V architectures and prompt distributions used for design. This directly affects the generalization required for the claim to hold outside the evaluated setting.

    Authors: We acknowledge that the primary evaluations focus on the dual-stream I2V architectures and prompt sets used during development. The manuscript does include ablations isolating the contribution of each mechanism to counteracting dilution and override within those settings. We have revised the method and experimental sections to more explicitly delineate the evaluated scope and added a limitations discussion on generalization. Broader cross-architecture validation beyond the tested models would require additional experiments not present in the current work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical construction without self-referential derivations or fitted predictions

full rationale

The paper's central contribution is an empirical framework (temporally balanced latent divergence plus trajectory alignment) motivated by observed failure modes in I2V models. No equations, fitted parameters, or 'predictions' are presented that reduce by construction to the inputs or to self-citations. The method is described as an engineering response to dilution and override effects rather than a first-principles derivation. Self-citations, if present, are not load-bearing for the core claims, and the work remains self-contained against external benchmarks. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract introduces one invented construct (the precomputed collapse-inducing trajectory) whose independent evidence is not shown. No explicit free parameters are named, but the method implicitly depends on choices for balancing weights and trajectory selection that are not detailed.

axioms (1)
  • domain assumption The video encoder's latent space allows additive adversarial perturbations to be propagated without immediate collapse.
    Invoked when stating that naive attacks are diluted; required for the balancing step to be meaningful.
invented entities (1)
  • collapse-inducing trajectory no independent evidence
    purpose: A precomputed path in intermediate generative representations that forces video output to degrade when aligned to it.
    Introduced to counteract text-guidance override; no external falsifiable prediction or measurement is provided in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1326 out tokens · 47152 ms · 2026-05-10T15:05:26.236367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffu- sion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

  4. [4]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limita- tions, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

  5. [5]

    Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermea- sures, and way forward: Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward.Applied intelligence, 53(4):3974–4026, 2023

  6. [6]

    Wan: Open and advanced large-scale video generative models, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  7. [7]

    Deepfake detection: A systematic literature review.IEEE access, 10:25494–25513, 2022

    Md Shohel Rana, Mohammad Nur Nobi, Beddhu Murali, and Andrew H Sung. Deepfake detection: A systematic literature review.IEEE access, 10:25494–25513, 2022

  8. [8]

    Explaining and harnessing adver- sarial examples.3rd International Conference on Learning Representations, 2015

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples.3rd International Conference on Learning Representations, 2015

  9. [9]

    Adversarial examples in the physical world

    Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. InArtificial intelligence safety and security, pages 99–112. Chapman and Hall/CRC, 2018

  10. [10]

    On the robustness of semantic segmentation models to adversarial attacks

    Anurag Arnab, Ondrej Miksik, and Philip HS Torr. On the robustness of semantic segmentation models to adversarial attacks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 888–897, 2018

  11. [11]

    Adversarial attacks against closed-source MLLMs via feature optimal alignment

    Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source MLLMs via feature optimal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  12. [12]

    Adversarial example does good: Preventing painting imitation from diffusion models via adversarial examples

    Chumeng Liang, Xiaoyu Wu, Yang Hua, Jiaru Zhang, Yiming Xue, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. Adversarial example does good: Preventing painting imitation from diffusion models via adversarial examples. InInternational Conference on Machine Learning, pages 20763–20786. PMLR, 2023

  13. [13]

    Rais- ing the cost of malicious ai-powered image editing

    Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, and Aleksander Madry. Rais- ing the cost of malicious ai-powered image editing. InInternational Conference on Machine Learning, pages 29894–29918. PMLR, 2023. 12

  14. [14]

    Mist: Towards improved adversarial examples for diffusion models

    Chumeng Liang and Xiaoyu Wu. Mist: Towards improved adversarial examples for diffusion models.arXiv preprint arXiv:2305.12683, 2023

  15. [15]

    Toward effective protection against diffusion-based mimicry through score distillation

    Haotian Xue, Chumeng Liang, Xiaoyu Wu, and Yongxin Chen. Toward effective protection against diffusion-based mimicry through score distillation. InThe Twelfth International Con- ference on Learning Representations, 2023

  16. [16]

    Distraction is all you need: Memory-efficient image immunization against diffusion-based image editing

    Ling Lo, Cheng Yu Yeo, Hong-Han Shuai, and Wen-Huang Cheng. Distraction is all you need: Memory-efficient image immunization against diffusion-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24462– 24471, 2024

  17. [17]

    Metacloak: Pre- venting unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning

    Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, and Lichao Sun. Metacloak: Pre- venting unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24219–24228, 2024

  18. [18]

    Glaze: Protecting artists from style mimicry by{Text-to-Image}models

    Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. Glaze: Protecting artists from style mimicry by{Text-to-Image}models. In32nd USENIX Security Symposium (USENIX Security 23), pages 2187–2204, 2023

  19. [19]

    Imperceptible protection against style imitation from diffusion models.IEEE Transactions on Multimedia, 2026

    Namhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo, Daesik Kim, and Seung-Hun Nam. Imperceptible protection against style imitation from diffusion models.IEEE Transactions on Multimedia, 2026

  20. [20]

    Chinchali, and James Matthew Rehg

    Tarik Can Ozden, Ozgur Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava, Sandeep P. Chinchali, and James Matthew Rehg. Diffvax: Optimization-free image immuniza- tion against diffusion-based editing. InThe Fourteenth International Conference on Learning Representations, 2026

  21. [21]

    Make- a-video: Text-to-video generation without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- a-video: Text-to-video generation without text-video data. InThe Eleventh International Con- ference on Learning Representations, 2023

  22. [22]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

  23. [23]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  24. [24]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

  25. [25]

    Goku: Flow based video generative foundation models

    Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, et al. Goku: Flow based video generative foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23516–23527, 2025

  26. [26]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Repre- sentations, 2025

  27. [27]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–

  28. [28]

    arXiv preprint arXiv:2311.04145 (2023)

    Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023

  29. [29]

    Consisti2v: Enhancing visual consistency for image-to-video generation.Transactions on Machine Learning Research, 2024

    Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.Transactions on Machine Learning Research, 2024

  30. [30]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation

    Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024

  31. [31]

    arXiv preprint arXiv:2312.03793 , year=

    Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, and Jian Zhang. Animatezero: Video diffusion models are zero-shot image animators.arXiv preprint arXiv:2312.03793, 2023

  32. [32]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024

  33. [33]

    Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

    Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming-Chang Yang, and Jiaya Jia. Con- trolnext: Powerful and efficient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

  34. [34]

    Follow-your-shape: Shape-aware image editing via trajectory-guided region control

    Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Lin- feng Zhang, Qifeng Chen, and Yue Ma. Follow-your-shape: Shape-aware image editing via trajectory-guided region control. InThe Fourteenth International Conference on Learning Representations, 2026

  35. [35]

    Follow your pose: Pose-guided text-to-video generation using pose-free videos

    Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video gen- eration: A survey.arXiv preprint arXiv:2507.16869, 2025

  36. [36]

    Follow-your-click: Open-domain re- gional image animation via motion prompts

    Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain re- gional image animation via motion prompts. InProceedings of the AAAI Conference on Arti- ficial Intelligence, volume 39, pages 6018–6026, 2025

  37. [37]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

  38. [38]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  39. [39]

    Kling AI.https://klingai.com, 2024

    Kling AI Team. Kling AI.https://klingai.com, 2024

  40. [40]

    Dream Machine.https://lumalabs.ai/dream-machine, 2024

    Luma AI Team. Dream Machine.https://lumalabs.ai/dream-machine, 2024

  41. [41]

    Decontext as defense: Safe image editing in diffusion transformers.arXiv preprint arXiv:2512.16625, 2025

    Linghui Shen, Mingyue Cui, and Xingyi Yang. Decontext as defense: Safe image editing in diffusion transformers.arXiv preprint arXiv:2512.16625, 2025

  42. [42]

    Pixel is a barrier: Diffusion models are more adversarially robust than we think.arXiv preprint arXiv:2404.13320, 2024

    Haotian Xue and Yongxin Chen. Pixel is a barrier: Diffusion models are more adversarially robust than we think.arXiv preprint arXiv:2404.13320, 2024

  43. [43]

    I2vguard: Safeguarding images against misuse in diffusion-based image-to-video models

    Dongnan Gui, Xun Guo, Wengang Zhou, and Yan Lu. I2vguard: Safeguarding images against misuse in diffusion-based image-to-video models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12595–12604, 2025. 14

  44. [44]

    Prime: Protect your videos from malicious editing.arXiv preprint arXiv:2402.01239, 2024

    Guanlin Li, Shuai Yang, Jie Zhang, and Tianwei Zhang. Prime: Protect your videos from malicious editing.arXiv preprint arXiv:2402.01239, 2024

  45. [45]

    T2vattack: Adversarial attack on text-to-video diffusion models.arXiv preprint arXiv:2512.23953, 2025

    Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, and Xilin Chen. T2vattack: Adversarial attack on text-to-video diffusion models.arXiv preprint arXiv:2512.23953, 2025

  46. [46]

    Diffusion policy attacker: Crafting adversar- ial attacks for diffusion-based policies.Advances in Neural Information Processing Systems, 37:119614–119637, 2024

    Yipu Chen, Haotian Xue, and Yongxin Chen. Diffusion policy attacker: Crafting adversar- ial attacks for diffusion-based policies.Advances in Neural Information Processing Systems, 37:119614–119637, 2024

  47. [47]

    How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

    Akansha Kalra, Basavasagar Patil, Guanhong Tao, and Daniel S Brown. How vulnerable is my learned policy? universal adversarial perturbation attacks on modern behavior cloning policies. arXiv preprint arXiv:2502.03698, 2025

  48. [48]

    Diffusionguard: A robust defense against malicious diffusion-based image editing

    June Suk Choi, Kyungmin Lee, Jongheon Jeong, Saining Xie, Jinwoo Shin, and Kimin Lee. Diffusionguard: A robust defense against malicious diffusion-based image editing. InThe Thirteenth International Conference on Learning Representations, 2025

  49. [49]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  50. [50]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Confer- ence on Learning Representations, 2018

  51. [51]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

  52. [52]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

  53. [53]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  54. [54]

    Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

  55. [55]

    Google AI for developers.https://ai.google.dev/, 2026

    Google. Google AI for developers.https://ai.google.dev/, 2026. Accessed: 2026-03- 05

  56. [56]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  57. [57]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  58. [58]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  59. [59]

    A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

    Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024. 15

  60. [60]

    Toward generalized detec- tion of synthetic media: Limitations, challenges, and the path to multimodal solutions.arXiv preprint arXiv:2511.11116, 2025

    Redwan Hussain, Mizanur Rahman, and Prithwiraj Bhattacharjee. Toward generalized detec- tion of synthetic media: Limitations, challenges, and the path to multimodal solutions.arXiv preprint arXiv:2511.11116, 2025

  61. [61]

    A woman in a black dress walks through a sunny park

    Claudiu Popa, Rex Pallath, Liam Cunningham, Hewad Tahiri, Abiram Kesavarajah, and Tao Wu. Deepfake technology unveiled: the commoditization of ai and its impact on digital trust. arXiv preprint arXiv:2506.07363, 2025. 16 Appendix Contents A Immune2V Algorithm Details 18 A.1 Method Overview (Recap) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...