Recognition: unknown
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
Pith reviewed 2026-05-10 15:05 UTC · model grok-4.3
The pith
Immune2V protects images from video deepfakes by balancing noise at the encoder and aligning generation toward collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modern I2V models resist naive image-level adversarial attacks because video encoding rapidly dilutes the adversarial noise across future frames and continuous text-conditioned guidance overrides the disruptive intent. Immune2V addresses this by enforcing temporally balanced latent divergence at the encoder level to prevent signal dilution and by aligning intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override, producing substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.
What carries the argument
Temporally balanced latent divergence at the encoder level together with alignment to a precomputed collapse-inducing trajectory, which maintains adversarial signal across time steps and steers the generation process away from coherent output.
If this is right
- Videos generated from immunized images exhibit stronger and longer-lasting visual degradation than those from images protected by adapted static methods.
- The protection remains effective while the changes to the original image stay imperceptible to viewers.
- Defenses against video synthesis must operate inside the model's temporal and conditional mechanisms rather than only at the input image level.
- The encoder balancing and trajectory alignment can be used as a starting point for protecting against other forms of conditional video generation.
Where Pith is reading between the lines
- The same balancing principle could be tested on multi-frame tasks such as animation or 3D lifting where signal dilution across outputs is also likely.
- Understanding the internal encoder dynamics of a generator appears necessary for robust immunization, pointing toward architecture-specific rather than generic input perturbations.
- Real-world deployment would require checking performance on user-chosen text prompts and commercial I2V services not studied in the paper.
Load-bearing premise
That noise dilution in video encoding and override by text guidance are the dominant reasons image attacks fail, and that encoder-level balancing plus trajectory alignment will work across different I2V architectures and prompts.
What would settle it
Applying Immune2V to an I2V model not used in the original experiments and measuring whether the resulting videos still exhibit substantially stronger and more persistent degradation than image-level baselines under identical imperceptibility constraints.
Figures
read the original abstract
Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Immune2V, a framework to immunize static images against dual-stream image-to-video (I2V) generation. It analyzes two failure modes of naive adversarial attacks—rapid dilution of noise during video encoding and override by continuous text-conditioned guidance—and proposes temporally balanced latent divergence at the encoder level plus alignment of intermediate representations to a precomputed collapse-inducing trajectory. The central claim is that this produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.
Significance. If the experimental claims hold with detailed quantitative support and cross-model validation, the work would be significant for extending adversarial immunization from static images to video generation, addressing deepfake risks. The systematic breakdown of I2V robustness mechanisms is a conceptual strength that could inform future defenses in generative models.
major comments (2)
- Abstract: the central claim that Immune2V 'produces substantially stronger and more persistent degradation than adapted image-level baselines' is presented without any quantitative metrics, error bars, specific improvement values (e.g., degradation scores or success rates), model details, or ablation results. This absence makes the empirical superiority impossible to assess and is load-bearing for the paper's main contribution.
- The method section (and associated experiments): the two proposed mechanisms—temporally balanced latent divergence and alignment to a precomputed collapse-inducing trajectory—are asserted to counteract dilution and text-guidance override, but no evidence is provided that the precomputed trajectory or balancing strategy transfers beyond the specific dual-stream I2V architectures and prompt distributions used for design. This directly affects the generalization required for the claim to hold outside the evaluated setting.
minor comments (1)
- Abstract: the term 'collapse-inducing trajectory' is introduced without a concise definition or reference to its computation; adding a brief parenthetical explanation would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest responses based on the manuscript content. Revisions have been made where the comments identify clear gaps in presentation or support.
read point-by-point responses
-
Referee: Abstract: the central claim that Immune2V 'produces substantially stronger and more persistent degradation than adapted image-level baselines' is presented without any quantitative metrics, error bars, specific improvement values (e.g., degradation scores or success rates), model details, or ablation results. This absence makes the empirical superiority impossible to assess and is load-bearing for the paper's main contribution.
Authors: We agree that the abstract would be strengthened by including key quantitative results to support the central claim. The body of the manuscript contains these metrics (degradation scores, success rates, error bars, model specifics, and ablation outcomes), but they were not summarized in the abstract. We have revised the abstract to incorporate representative quantitative values and model details drawn directly from the experimental results, while keeping the abstract concise. revision: yes
-
Referee: The method section (and associated experiments): the two proposed mechanisms—temporally balanced latent divergence and alignment to a precomputed collapse-inducing trajectory—are asserted to counteract dilution and text-guidance override, but no evidence is provided that the precomputed trajectory or balancing strategy transfers beyond the specific dual-stream I2V architectures and prompt distributions used for design. This directly affects the generalization required for the claim to hold outside the evaluated setting.
Authors: We acknowledge that the primary evaluations focus on the dual-stream I2V architectures and prompt sets used during development. The manuscript does include ablations isolating the contribution of each mechanism to counteracting dilution and override within those settings. We have revised the method and experimental sections to more explicitly delineate the evaluated scope and added a limitations discussion on generalization. Broader cross-architecture validation beyond the tested models would require additional experiments not present in the current work. revision: partial
Circularity Check
No circularity: empirical construction without self-referential derivations or fitted predictions
full rationale
The paper's central contribution is an empirical framework (temporally balanced latent divergence plus trajectory alignment) motivated by observed failure modes in I2V models. No equations, fitted parameters, or 'predictions' are presented that reduce by construction to the inputs or to self-citations. The method is described as an engineering response to dilution and override effects rather than a first-principles derivation. Self-citations, if present, are not load-bearing for the core claims, and the work remains self-contained against external benchmarks. This matches the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The video encoder's latent space allows additive adversarial perturbations to be propagated without immediate collapse.
invented entities (1)
-
collapse-inducing trajectory
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020
2020
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffu- sion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review arXiv 2022
-
[4]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limita- tions, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermea- sures, and way forward: Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward.Applied intelligence, 53(4):3974–4026, 2023
2023
-
[6]
Wan: Open and advanced large-scale video generative models, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
2025
-
[7]
Deepfake detection: A systematic literature review.IEEE access, 10:25494–25513, 2022
Md Shohel Rana, Mohammad Nur Nobi, Beddhu Murali, and Andrew H Sung. Deepfake detection: A systematic literature review.IEEE access, 10:25494–25513, 2022
2022
-
[8]
Explaining and harnessing adver- sarial examples.3rd International Conference on Learning Representations, 2015
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples.3rd International Conference on Learning Representations, 2015
2015
-
[9]
Adversarial examples in the physical world
Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. InArtificial intelligence safety and security, pages 99–112. Chapman and Hall/CRC, 2018
2018
-
[10]
On the robustness of semantic segmentation models to adversarial attacks
Anurag Arnab, Ondrej Miksik, and Philip HS Torr. On the robustness of semantic segmentation models to adversarial attacks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 888–897, 2018
2018
-
[11]
Adversarial attacks against closed-source MLLMs via feature optimal alignment
Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source MLLMs via feature optimal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[12]
Adversarial example does good: Preventing painting imitation from diffusion models via adversarial examples
Chumeng Liang, Xiaoyu Wu, Yang Hua, Jiaru Zhang, Yiming Xue, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. Adversarial example does good: Preventing painting imitation from diffusion models via adversarial examples. InInternational Conference on Machine Learning, pages 20763–20786. PMLR, 2023
2023
-
[13]
Rais- ing the cost of malicious ai-powered image editing
Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, and Aleksander Madry. Rais- ing the cost of malicious ai-powered image editing. InInternational Conference on Machine Learning, pages 29894–29918. PMLR, 2023. 12
2023
-
[14]
Mist: Towards improved adversarial examples for diffusion models
Chumeng Liang and Xiaoyu Wu. Mist: Towards improved adversarial examples for diffusion models.arXiv preprint arXiv:2305.12683, 2023
-
[15]
Toward effective protection against diffusion-based mimicry through score distillation
Haotian Xue, Chumeng Liang, Xiaoyu Wu, and Yongxin Chen. Toward effective protection against diffusion-based mimicry through score distillation. InThe Twelfth International Con- ference on Learning Representations, 2023
2023
-
[16]
Distraction is all you need: Memory-efficient image immunization against diffusion-based image editing
Ling Lo, Cheng Yu Yeo, Hong-Han Shuai, and Wen-Huang Cheng. Distraction is all you need: Memory-efficient image immunization against diffusion-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24462– 24471, 2024
2024
-
[17]
Metacloak: Pre- venting unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning
Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, and Lichao Sun. Metacloak: Pre- venting unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24219–24228, 2024
2024
-
[18]
Glaze: Protecting artists from style mimicry by{Text-to-Image}models
Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. Glaze: Protecting artists from style mimicry by{Text-to-Image}models. In32nd USENIX Security Symposium (USENIX Security 23), pages 2187–2204, 2023
2023
-
[19]
Imperceptible protection against style imitation from diffusion models.IEEE Transactions on Multimedia, 2026
Namhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo, Daesik Kim, and Seung-Hun Nam. Imperceptible protection against style imitation from diffusion models.IEEE Transactions on Multimedia, 2026
2026
-
[20]
Chinchali, and James Matthew Rehg
Tarik Can Ozden, Ozgur Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava, Sandeep P. Chinchali, and James Matthew Rehg. Diffvax: Optimization-free image immuniza- tion against diffusion-based editing. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[21]
Make- a-video: Text-to-video generation without text-video data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- a-video: Text-to-video generation without text-video data. InThe Eleventh International Con- ference on Learning Representations, 2023
2023
-
[22]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023
2023
-
[23]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
2024
-
[24]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[25]
Goku: Flow based video generative foundation models
Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, et al. Goku: Flow based video generative foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23516–23527, 2025
2025
-
[26]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Repre- sentations, 2025
2025
-
[27]
Dynamicrafter: Animating open-domain images with video diffusion priors
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–
-
[28]
arXiv preprint arXiv:2311.04145 (2023)
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023
-
[29]
Consisti2v: Enhancing visual consistency for image-to-video generation.Transactions on Machine Learning Research, 2024
Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.Transactions on Machine Learning Research, 2024
2024
-
[30]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation
Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024
2024
-
[31]
arXiv preprint arXiv:2312.03793 , year=
Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, and Jian Zhang. Animatezero: Video diffusion models are zero-shot image animators.arXiv preprint arXiv:2312.03793, 2023
-
[32]
Animatediff: Animate your personalized text-to-image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[33]
Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming-Chang Yang, and Jiaya Jia. Con- trolnext: Powerful and efficient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024
-
[34]
Follow-your-shape: Shape-aware image editing via trajectory-guided region control
Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Lin- feng Zhang, Qifeng Chen, and Yue Ma. Follow-your-shape: Shape-aware image editing via trajectory-guided region control. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[35]
Follow your pose: Pose-guided text-to-video generation using pose-free videos
Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video gen- eration: A survey.arXiv preprint arXiv:2507.16869, 2025
-
[36]
Follow-your-click: Open-domain re- gional image animation via motion prompts
Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain re- gional image animation via motion prompts. InProceedings of the AAAI Conference on Arti- ficial Intelligence, volume 39, pages 6018–6026, 2025
2025
-
[37]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...
2025
-
[38]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Kling AI.https://klingai.com, 2024
Kling AI Team. Kling AI.https://klingai.com, 2024
2024
-
[40]
Dream Machine.https://lumalabs.ai/dream-machine, 2024
Luma AI Team. Dream Machine.https://lumalabs.ai/dream-machine, 2024
2024
-
[41]
Linghui Shen, Mingyue Cui, and Xingyi Yang. Decontext as defense: Safe image editing in diffusion transformers.arXiv preprint arXiv:2512.16625, 2025
-
[42]
Haotian Xue and Yongxin Chen. Pixel is a barrier: Diffusion models are more adversarially robust than we think.arXiv preprint arXiv:2404.13320, 2024
-
[43]
I2vguard: Safeguarding images against misuse in diffusion-based image-to-video models
Dongnan Gui, Xun Guo, Wengang Zhou, and Yan Lu. I2vguard: Safeguarding images against misuse in diffusion-based image-to-video models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12595–12604, 2025. 14
2025
-
[44]
Prime: Protect your videos from malicious editing.arXiv preprint arXiv:2402.01239, 2024
Guanlin Li, Shuai Yang, Jie Zhang, and Tianwei Zhang. Prime: Protect your videos from malicious editing.arXiv preprint arXiv:2402.01239, 2024
-
[45]
Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, and Xilin Chen. T2vattack: Adversarial attack on text-to-video diffusion models.arXiv preprint arXiv:2512.23953, 2025
-
[46]
Diffusion policy attacker: Crafting adversar- ial attacks for diffusion-based policies.Advances in Neural Information Processing Systems, 37:119614–119637, 2024
Yipu Chen, Haotian Xue, and Yongxin Chen. Diffusion policy attacker: Crafting adversar- ial attacks for diffusion-based policies.Advances in Neural Information Processing Systems, 37:119614–119637, 2024
2024
-
[47]
Akansha Kalra, Basavasagar Patil, Guanhong Tao, and Daniel S Brown. How vulnerable is my learned policy? universal adversarial perturbation attacks on modern behavior cloning policies. arXiv preprint arXiv:2502.03698, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Diffusionguard: A robust defense against malicious diffusion-based image editing
June Suk Choi, Kyungmin Lee, Jongheon Jeong, Saining Xie, Jinwoo Shin, and Kimin Lee. Diffusionguard: A robust defense against malicious diffusion-based image editing. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[49]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
2021
-
[50]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Confer- ence on Learning Representations, 2018
2018
-
[51]
A benchmark dataset and evaluation methodology for video object segmentation
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016
2016
-
[52]
Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023
2023
-
[53]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[54]
Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
2022
-
[55]
Google AI for developers.https://ai.google.dev/, 2026
Google. Google AI for developers.https://ai.google.dev/, 2026. Accessed: 2026-03- 05
2026
-
[56]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
2020
-
[57]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
2015
-
[58]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[59]
A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024. 15
2024
-
[60]
Redwan Hussain, Mizanur Rahman, and Prithwiraj Bhattacharjee. Toward generalized detec- tion of synthetic media: Limitations, challenges, and the path to multimodal solutions.arXiv preprint arXiv:2511.11116, 2025
-
[61]
A woman in a black dress walks through a sunny park
Claudiu Popa, Rex Pallath, Liam Cunningham, Hewad Tahiri, Abiram Kesavarajah, and Tao Wu. Deepfake technology unveiled: the commoditization of ai and its impact on digital trust. arXiv preprint arXiv:2506.07363, 2025. 16 Appendix Contents A Immune2V Algorithm Details 18 A.1 Method Overview (Recap) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.