Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute
Pith reviewed 2026-05-22 17:46 UTC · model grok-4.3
The pith
A zero-shot method generates personalized videos by training once on subject images and random videos at 1% of prior compute costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing SDV-Gen into identity injection learned from subject-image pairs and motion-awareness preservation maintained by a small set of arbitrary videos, and optimizing the two tasks with stochastic switching using random reference-frame sampling and image-token dropout, a single model can be adapted with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours on CogVideoX-5B. This yields about 1% of the compute compared to prior zero-shot baselines while using no subject-video pairs and remaining competitive in subject fidelity and motion quality. The same recipe transfers to Wan 2.2-5B.
What carries the argument
Stochastic switching between identity injection and motion-awareness preservation tasks, with the two objectives shown by gradient analysis to evolve toward nearly orthogonal update subspaces.
If this is right
- The training recipe transfers directly to other pretrained video models such as Wan 2.2-5B.
- Subject fidelity and motion quality remain competitive with prior zero-shot methods that required orders of magnitude more compute and paired data.
- No per-subject tuning is needed at test time.
- Large-scale datasets of subject-video pairs are unnecessary for supervision.
Where Pith is reading between the lines
- Lowering the resource barrier could make personalized video generation practical for smaller labs or consumer tools.
- Similar decomposition into identity and motion objectives may apply to other conditional generation settings such as image or audio synthesis.
- Scaling the approach to even larger base models could produce further efficiency gains or quality improvements.
Load-bearing premise
Identity injection from subject images and motion awareness from arbitrary videos can be jointly optimized via stochastic switching without any subject-video pairs because the objectives evolve to nearly orthogonal update subspaces.
What would settle it
Training the same model without stochastic switching produces either collapsed motion quality or a sharp drop in subject fidelity relative to the reported baselines.
Figures
read the original abstract
Subject-driven video generation (SDV-Gen) aims to produce videos of a specific subject by adapting a pretrained video model, enabling personalized and application-driven content creation. To achieve this goal, per-subject tuning methods require approximately 200 A100 GPU hours to generate a customized video, whereas zero-shot methods avoid per-subject tuning but typically rely on millions of subject-video pairs for the supervision, incurring massive network fine-tuning costs (10K-200K A100 GPU hours). We propose a data- and compute-efficient zero-shot SDV-Gen framework that avoids test-time per-subject tuning and the use of large-scale subject-video pairs. Our key idea decomposes SDV-Gen into (i) identity injection learned from subject-image pairs and (ii) motion-awareness preservation maintained by a small set of arbitrary videos. We optimize the two tasks with stochastic switching, using random reference-frame sampling and image-token dropout to prevent trivial first-frame copying. Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization. Using CogVideoX-5B, we adapt a single model with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours. This yields about 1% of compute compared to prior zero-shot baselines (i.e., 0.4% of VACE and 2.8% of Phantom) while using no subject-video pairs, yet remaining competitive in subject fidelity and motion quality. We show that the same recipe transfers to Wan 2.2-5B.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a zero-shot subject-driven video generation (SDV-Gen) framework that decomposes the task into identity injection learned from 200K subject-image pairs and motion-awareness preservation from 4,000 arbitrary videos. These are jointly optimized on a pretrained model (CogVideoX-5B) via stochastic switching combined with random reference-frame sampling and image-token dropout, without requiring subject-video pairs. Gradient analysis is invoked to show that the objectives evolve toward nearly orthogonal update subspaces, enabling stable training in 288 A100 GPU hours (claimed ~1% of prior zero-shot baselines) while remaining competitive in subject fidelity and motion quality. The recipe is reported to transfer to Wan 2.2-5B.
Significance. If the efficiency and competitiveness claims are substantiated, the work would represent a substantial advance in lowering the data and compute barriers for personalized video generation, potentially making subject-driven methods more accessible. The decomposition strategy and gradient-based justification for avoiding paired data constitute a technically interesting approach to multi-objective fine-tuning in diffusion models.
major comments (3)
- [Gradient analysis] Gradient analysis section: The claim that the identity and motion objectives 'rapidly evolve toward nearly orthogonal update subspaces' is load-bearing for the central argument that stochastic switching suffices without subject-video pairs. The manuscript should provide quantitative support such as plots of gradient cosine similarity over training steps, variance across runs, or an ablation measuring motion coherence degradation when switching is removed or when identity updates dominate.
- [Experimental results] Experimental results section: The abstract and claims assert competitive subject fidelity and motion quality at 1% compute, yet no quantitative tables, metrics (e.g., subject similarity scores, motion quality metrics with error bars), or direct comparisons to VACE and Phantom are referenced in the provided description. This weakens the ability to evaluate the 1% compute advantage and competitiveness.
- [Training procedure] Training procedure section: The stochastic switching mechanism with random reference-frame sampling and image-token dropout is presented as preventing trivial copying, but the manuscript should include an ablation isolating the contribution of each component to final performance and to the observed orthogonality.
minor comments (2)
- [Evaluation metrics] Clarify the exact definition of 'subject fidelity' and 'motion quality' metrics used for competitiveness claims, including any human evaluation protocols.
- [Dataset description] Ensure all dataset sizes (200K subject-image pairs, 4,000 arbitrary videos) are consistently reported with details on curation and diversity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, outlining how we will strengthen the manuscript while preserving its core contributions on data-efficient zero-shot subject-driven video generation.
read point-by-point responses
-
Referee: [Gradient analysis] Gradient analysis section: The claim that the identity and motion objectives 'rapidly evolve toward nearly orthogonal update subspaces' is load-bearing for the central argument that stochastic switching suffices without subject-video pairs. The manuscript should provide quantitative support such as plots of gradient cosine similarity over training steps, variance across runs, or an ablation measuring motion coherence degradation when switching is removed or when identity updates dominate.
Authors: We appreciate the referee's focus on making the gradient analysis more rigorous. The manuscript already presents a gradient analysis showing that the objectives evolve toward nearly orthogonal subspaces, which underpins the stability of stochastic switching. To provide the requested quantitative support, we will add plots of gradient cosine similarity over training steps (with variance across runs) and an ablation measuring motion coherence degradation when switching is removed or identity updates dominate. These additions will be included in the revised gradient analysis section. revision: yes
-
Referee: [Experimental results] Experimental results section: The abstract and claims assert competitive subject fidelity and motion quality at 1% compute, yet no quantitative tables, metrics (e.g., subject similarity scores, motion quality metrics with error bars), or direct comparisons to VACE and Phantom are referenced in the provided description. This weakens the ability to evaluate the 1% compute advantage and competitiveness.
Authors: We acknowledge that the experimental claims would be stronger with more explicit quantitative backing in the main text. The manuscript reports competitiveness in subject fidelity and motion quality at 1% compute relative to VACE and Phantom, but we will expand the experimental results section to include dedicated tables with subject similarity scores, motion quality metrics (including error bars), and direct numerical comparisons to VACE and Phantom. These will be clearly referenced from the abstract and introduction in the revision. revision: yes
-
Referee: [Training procedure] Training procedure section: The stochastic switching mechanism with random reference-frame sampling and image-token dropout is presented as preventing trivial copying, but the manuscript should include an ablation isolating the contribution of each component to final performance and to the observed orthogonality.
Authors: We agree that component-wise ablations would clarify the design choices. The current training procedure uses stochastic switching combined with random reference-frame sampling and image-token dropout to avoid trivial copying while promoting orthogonality. In the revision, we will add an ablation study that isolates the contribution of each element (switching, reference-frame sampling, and token dropout) to both final performance metrics and the observed gradient orthogonality. revision: yes
Circularity Check
No circularity: derivation rests on empirical gradient analysis and standard training decomposition, not definitional reduction
full rationale
The paper's central claim decomposes SDV-Gen into identity injection (from subject-image pairs) and motion preservation (from arbitrary videos) optimized via stochastic switching. The load-bearing justification is the reported gradient analysis showing rapid evolution to nearly orthogonal update subspaces, which is presented as an empirical observation rather than a mathematical identity or fitted parameter renamed as prediction. No equations reduce the orthogonality result to the input data by construction, no self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled via prior work. The 288 A100-hour training outcome and competitive fidelity claims follow from the described procedure without tautological equivalence to the inputs. This is a standard empirical method paper whose performance claims are externally falsifiable via replication on the stated datasets and model.
Axiom & Free-Parameter Ledger
free parameters (2)
- Number of subject-image pairs =
200K
- Number of arbitrary videos =
4000
axioms (2)
- domain assumption Subject identity can be effectively learned from image pairs alone while motion awareness is maintained by arbitrary videos.
- domain assumption The identity and motion objectives evolve toward nearly orthogonal update subspaces during joint optimization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization... stochastic task-switching... p=0.2
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decomposes SDV-Gen into (i) identity injection... (ii) motion-awareness preservation... 288 A100 GPU hours... 1% of compute
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024
Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024. 2
-
[2]
The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023. 7
-
[3]
Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024. 2
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang
Kelvin C.K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang. Improving subject-driven image syn- thesis with subject-agnostic guidance. InCVPR, 2024. 3
work page 2024
-
[6]
Efficient lifelong learning with a- gem
Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a- gem. InInternational Conference on Learning Representa- tions, 2019. 3
work page 2019
-
[7]
Still-moving: Cus- tomized video generation without customized video data
Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, and Inbar Mosseri. Still-moving: Cus- tomized video generation without customized video data. arXiv:2407.08674, 2024. 2, 3, 7, 8, 22
-
[8]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. 2
work page 2024
-
[9]
Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation
Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation. InICLR, 2024. 3
work page 2024
-
[10]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Multi-subject open-set personalization in video generation
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. arXiv:2501.06187, 2025. 2, 3
-
[12]
Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025. 3
-
[13]
Freecustom: Tuning-free cus- tomized image generation for multi-concept composition
Ganggui Ding et al. Freecustom: Tuning-free cus- tomized image generation for multi-concept composition. arXiv:2405.13870, 2024. 3
-
[14]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3
work page 2024
-
[15]
Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999
Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 3
work page 1999
-
[16]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H. Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024
Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024. 3
-
[19]
Latent video diffusion models for high-fidelity long video generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. 2022. 2
work page 2022
-
[20]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv:2205.15868, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 4
work page 2022
-
[22]
Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025
Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025. 2, 3
work page 2025
-
[23]
Zhihao Hu and Dong Xu. Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023. 2
-
[24]
Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025. 2, 3
-
[25]
VBench: Com- prehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, 9 Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In CVPR, 2024. 7
work page 2024
-
[26]
Videobooth: Diffusion-based video generation with image prompts
Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, and Ziwei Liu. Videobooth: Diffusion-based video generation with image prompts. InCVPR, 2024. 2, 3, 6
work page 2024
-
[27]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv:2503.07598, 2025. 2, 3, 6, 21, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis
Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis. arXiv:2502.08244, 2025. 19, 20
-
[29]
Pexels-400k.https : / / huggingface
jovianzm. Pexels-400k.https : / / huggingface . co/datasets/jovianzm/Pexels-400k, 2025. Ac- cessed: 2025-03-07. 2, 3, 5, 18, 20
work page 2025
-
[30]
Multi-concept customization of text- to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shecht- man, and Jun-Yan Zhu. Multi-concept customization of text- to-image diffusion. InCVPR, 2023. 3
work page 2023
-
[31]
Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. InNeurIPS, 2023. 2, 6
work page 2023
-
[32]
Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,
-
[33]
Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei
Lijie Liu, Tianxaing Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment. arXiv:2502.11079, 2025. 2, 3, 6, 21, 24, 25
-
[34]
Customizable image synthesis with multiple subjects
Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. InAdvances in neural information processing systems, 2023. 3
work page 2023
-
[35]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in neu- ral information processing systems, pages 6467–6476, 2017. 3
work page 2017
-
[36]
Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015. 3
work page 2015
-
[37]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models. InCVPR, 2023. 3
work page 2023
-
[38]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3
work page 2023
-
[39]
Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995
Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995. 3
work page 1995
-
[40]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models.arXiv:2112.10752, 2021. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 2, 3, 7
work page 2023
-
[42]
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[43]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[44]
OminiControl: Minimal and Universal Control for Diffusion Transformer,
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv:2411.15098, 2024. 2, 3, 4, 5, 6, 7, 8, 17, 21, 22, 23, 24, 25
-
[45]
Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020. 20
-
[46]
Training-free con- sistent text-to-image generation.TOG, 2024
Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation.TOG, 2024. 3
work page 2024
-
[47]
Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion
Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2161– 2172, 2025. 2
work page 2025
-
[48]
Xiaowei Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance.arXiv:2406.07209, 2024. 3
-
[49]
Wan: Open and Advanced Large-Scale Video Generative Models
WanTeam et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Dreamvideo: Composing your dream videos with customized subject and motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhi- heng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hong- ming Shan. Dreamvideo: Composing your dream videos with customized subject and motion. InCVPR, 2024. 2, 3
work page 2024
-
[51]
Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv:2410.13830, 2024. 2, 3
-
[52]
Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024
Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024. 3
-
[53]
Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guang- cong Zheng, Zhongang Qi, Ying Shan, and Xi Li. Custom- crafter: Customized video generation with preserving motion and concept composition abilities. InAAAI, 2025. 2, 3, 7, 8, 21, 23
work page 2025
-
[54]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffu- sion models with an expert transformer.arXiv:2408.06072,
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv:2308.06721, 2023. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Gradient surgery for multi-task learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural Information Pro- cessing Systems, 2020. 3, 6, 8
work page 2020
-
[57]
Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024. 2, 3
-
[58]
Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025. 3, 20, 21
-
[59]
Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji
Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation. InCVPR, 2024. 3
work page 2024
-
[60]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3
work page 2023
-
[61]
Ssr-encoder: Encoding selective subject representation for subject-driven generation
Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InCVPR, 2024. 3
work page 2024
-
[62]
Magic mirror: Id-preserved video generation in video diffusion transformers
Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic mirror: Id-preserved video generation in video diffusion transformers. InICCV, 2025. 2, 3 11 Subject-driven Video Generation via Disentangled Identity and Motion Supplementary Material Note: We use green color to refer to figures, tables in the main manuscript (e.g.,...
work page 2025
-
[63]
Register backward hooks on modules containingθ train to copy per-parameter.gradinto a preallocated flat buffer in a fixed parameter order. 0 250 500 750 1000 Step 1.0 0.5 0.0 0.5 1.0 Gradient Alignment 0 250 500 750 1000 Step 0.0 0.5 1.0 1.5 2.0 2.5Grad Norm Image Video Figure 9.Gradient alignment between image and video batches during image-only finetuni...
-
[64]
Capture the flat gradient on each rank before the framework performs all-reduce
-
[65]
Reconstruct the global gradient by reducing the sum of the flat buffers across ranks and dividing by the world size (equivalent to an average pre-sync gradient at the current step)
-
[66]
Move the aggregated flat gradient to the CPU for logging to minimize device memory pressure. When gradient accumula- tion is used, we first accumulate local micro-batches, then cap- ture the pre-sync aggregate. Flattening and Layer-wise Grouping.Let˜g img(t)and ˜gvid(t)be the aggregated flat vectors formed by concatenat- ing per-parameter gradients fromθ ...
-
[67]
Foreground–Background Segmentation.For each video, we use an off-the-shelf segmentation model (e.g., Grounded-SAM2) on thefirst frameto separate foreground and background regions. This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts
-
[68]
Optical Flow Computation.We estimate optical flow between thefirst frameand each subsequent frame using a standard flow estimator (e.g., RAFT [45]). Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx. We record: FlowMagf = 1 Nf X x∈fg ∥uf(x)∥, FlowMagb = 1 Nb X x∈bg ∥ub(x)∥, whereN f...
-
[69]
Dataset Filtering.To ensure negligible camera motion, we discardany video whose average magnitude of background flow FlowMagb exceeds 10 pixels. This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion
-
[70]
Category Assignment.Based on the average magnitude of foreground flow FlowMagf (averaged over all frames), we cate- gorize videos into: •Small:0≤FlowMag f ≤25 •Medium:25<FlowMag f ≤50 •Large: FlowMag f >50 Each category contains 300 videos, ensuring a balanced evaluation of low-, moderate-, and high-motion scenarios
-
[71]
Evaluation Protocol.Within each subset, we use only thefirst frame(including any textual or reference cues, if required) to gen- erate a video of the same length. We then compute FVD [32] be- tween the generated outputs and the ground-truth videos. By com- paring FVD acrosssmall,medium, andlargemotion classes, we obtain a clearer picture of how each model...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.