pith. sign in

arxiv: 2504.17816 · v3 · submitted 2025-04-23 · 💻 cs.CV · eess.IV

Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

Pith reviewed 2026-05-22 17:46 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords subject-driven video generationzero-shot adaptationefficient fine-tuningidentity injectionmotion preservationstochastic switchingvideo diffusion models
0
0 comments X

The pith

A zero-shot method generates personalized videos by training once on subject images and random videos at 1% of prior compute costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes decomposing subject-driven video generation into identity injection learned from subject-image pairs and motion-awareness preservation from arbitrary videos. These tasks are jointly optimized through stochastic switching during training, supported by random reference-frame sampling and image-token dropout to avoid trivial copying. This yields a single adapted model that achieves competitive subject fidelity and motion quality using only 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours on CogVideoX-5B. A sympathetic reader would care because the method eliminates both per-subject test-time tuning and the need for large-scale paired subject-video data that previously drove massive compute requirements.

Core claim

By decomposing SDV-Gen into identity injection learned from subject-image pairs and motion-awareness preservation maintained by a small set of arbitrary videos, and optimizing the two tasks with stochastic switching using random reference-frame sampling and image-token dropout, a single model can be adapted with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours on CogVideoX-5B. This yields about 1% of the compute compared to prior zero-shot baselines while using no subject-video pairs and remaining competitive in subject fidelity and motion quality. The same recipe transfers to Wan 2.2-5B.

What carries the argument

Stochastic switching between identity injection and motion-awareness preservation tasks, with the two objectives shown by gradient analysis to evolve toward nearly orthogonal update subspaces.

If this is right

  • The training recipe transfers directly to other pretrained video models such as Wan 2.2-5B.
  • Subject fidelity and motion quality remain competitive with prior zero-shot methods that required orders of magnitude more compute and paired data.
  • No per-subject tuning is needed at test time.
  • Large-scale datasets of subject-video pairs are unnecessary for supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Lowering the resource barrier could make personalized video generation practical for smaller labs or consumer tools.
  • Similar decomposition into identity and motion objectives may apply to other conditional generation settings such as image or audio synthesis.
  • Scaling the approach to even larger base models could produce further efficiency gains or quality improvements.

Load-bearing premise

Identity injection from subject images and motion awareness from arbitrary videos can be jointly optimized via stochastic switching without any subject-video pairs because the objectives evolve to nearly orthogonal update subspaces.

What would settle it

Training the same model without stochastic switching produces either collapsed motion quality or a sharp drop in subject fidelity relative to the reported baselines.

Figures

Figures reproduced from arXiv: 2504.17816 by Chong Luo, Daneul Kim, Jaesik Park, Jingxu Zhang, Qi Dai, Sunghyun Cho, Wonjoon Jin.

Figure 1
Figure 1. Figure 1: Main results. Our method produces high-quality subject-driven video generation (SDV-Gen) in a zero-shot manner. We extend video generative models to have SDV-Gen capability using the small datasets. For the first two rows, we fine-tune CogVideoX-5B with 4,000 unpaired video set and 200K paired subject-image set. For the result in the last row, we use only 4,000 unpaired videos and 4,000 paired subject-imag… view at source ↗
Figure 2
Figure 2. Figure 2: Dual-task learning strategy. We formulate subject￾driven video generation as a dual-task problem. First is iden￾tity injection (Bottom) from paired subject-images, and second is motion-awareness preservation (Left), which we utilize unpaired videos and conduct stochastically switched learning. 2. Related Work 2.1. Subject-driven Image Generation Diffusion models have greatly advanced text-to-image gen￾erat… view at source ↗
Figure 3
Figure 3. Figure 3: Training and Inference Details. Left: During training, we stochastically alternate between two objectives: identity injection us￾ing paired subject-images and motion-awareness preservation using a small set of unpaired videos. Right: At inference time, no additional per-subject tuning is required. The model generates a video conditioned on the reference image and text prompt in a zero-shot manner. videos. … view at source ↗
Figure 4
Figure 4. Figure 4: Limitation of SDI-Gen→I2V method. With the subject presented small in the first frame, I2V fails to generate consistent results as it cannot interpret low-resolution subjects. 0 250 500 750 1000 Step 1.0 0.5 0.0 0.5 1.0 Gradient Alignment 0 250 500 750 1000 Step 0.0 0.2 0.4 0.6 0.8 Grad Norm Image Video [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gradient analysis on alignment and norms dur￾ing fine-tuning. Left: Cosine similarity ϕ(t) between gimg and gvid (over trainable parameters) quickly collapses to a narrow band near zero under dual-task training, indicating emergent near￾orthogonality. Right: ℓ2 norms ∥gimg(t)∥2 and ∥gvid(t)∥2 remain non-negligible and similar in scale after 100-step. 4.1. Do the Gradients Become Orthogonal? [PITH_FULL_IMA… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with zero-shot methods (left) and per-subject tuning methods (right). Ours mini denotes our model fine-tuned with 4,000 subset of Subject-200K [44]. Note that ours is zero-shot tuning-free, requiring no tuning at inference time. we add qualitative comparison with Still-Moving [7] and CustomCrafter [53], using the results reported in their pa￾per and supplements due to unavailable cod… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparison with PCGrad. our method demonstrates better detail, ID consistency, and motion-awareness than tuning-free baselines Vidu 2.0 and VideoBooth, and is competitive with Phantom and VACE with Wan. For the backpack example, our model faithfully reproduces fine textures and foreground motion, whereas Vidu 2.0 preserves ID but exhibits jittery trajectories, and VideoBooth yields the weakest … view at source ↗
Figure 9
Figure 9. Figure 9: Gradient alignment between image and video batches during image-only finetuning and gradient norms of two ob￾jectives. Recorded gradient alignment and norm every 50 steps. 2. Capture the flat gradient on each rank before the framework performs all-reduce. 3. Reconstruct the global gradient by reducing the sum of the flat buffers across ranks and dividing by the world size (equivalent to an average pre-sync… view at source ↗
Figure 10
Figure 10. Figure 10: Gradient alignment comparison of the two objec￾tives when applying PCGrad or not. Red and Orange line indi￾cate the alignment result with PCGrad with different buffer sizes, and blue dashed line indicates the alignment without PCGrad, equivalent to [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation on varying S&I set size used for training. Up to a subset count of 4K, they exhibit successful identity, but when reduced to 2K (≈ 1% of Subject200K), they show failure from time to time. Ours w/o random image drop w/o random initial frame + “A close up view. A bowl of oranges placed on a wooden table. The background is a dark room, the TV is on, and the screen is showing a cooking show.” +"On th… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative result on ablation study of our compo￾nent in temporal awareness preservation [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative comparison with Still-Moving [7]. Note that mini denotes our method trained with a 4K subset of Subject 200K [44]. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative comparison with CustomCrafter [53]. Note that mini denotes our method trained with 4K subset of Subject 200K [44]. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative comparison with tuning-free baselines including VACE Wan-1.3B [27], Phantom Wan-1.3B [33]. Note that mini denotes our method trained with 4K subset of Subject 200K [44] 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional qualitative comparison with tuning-free baselines including VACE Wan-1.3B [27], Phantom Wan-1.3B [33]. Note that mini denotes our method trained with 4K subset of Subject 200K [44] 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
read the original abstract

Subject-driven video generation (SDV-Gen) aims to produce videos of a specific subject by adapting a pretrained video model, enabling personalized and application-driven content creation. To achieve this goal, per-subject tuning methods require approximately 200 A100 GPU hours to generate a customized video, whereas zero-shot methods avoid per-subject tuning but typically rely on millions of subject-video pairs for the supervision, incurring massive network fine-tuning costs (10K-200K A100 GPU hours). We propose a data- and compute-efficient zero-shot SDV-Gen framework that avoids test-time per-subject tuning and the use of large-scale subject-video pairs. Our key idea decomposes SDV-Gen into (i) identity injection learned from subject-image pairs and (ii) motion-awareness preservation maintained by a small set of arbitrary videos. We optimize the two tasks with stochastic switching, using random reference-frame sampling and image-token dropout to prevent trivial first-frame copying. Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization. Using CogVideoX-5B, we adapt a single model with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours. This yields about 1% of compute compared to prior zero-shot baselines (i.e., 0.4% of VACE and 2.8% of Phantom) while using no subject-video pairs, yet remaining competitive in subject fidelity and motion quality. We show that the same recipe transfers to Wan 2.2-5B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a zero-shot subject-driven video generation (SDV-Gen) framework that decomposes the task into identity injection learned from 200K subject-image pairs and motion-awareness preservation from 4,000 arbitrary videos. These are jointly optimized on a pretrained model (CogVideoX-5B) via stochastic switching combined with random reference-frame sampling and image-token dropout, without requiring subject-video pairs. Gradient analysis is invoked to show that the objectives evolve toward nearly orthogonal update subspaces, enabling stable training in 288 A100 GPU hours (claimed ~1% of prior zero-shot baselines) while remaining competitive in subject fidelity and motion quality. The recipe is reported to transfer to Wan 2.2-5B.

Significance. If the efficiency and competitiveness claims are substantiated, the work would represent a substantial advance in lowering the data and compute barriers for personalized video generation, potentially making subject-driven methods more accessible. The decomposition strategy and gradient-based justification for avoiding paired data constitute a technically interesting approach to multi-objective fine-tuning in diffusion models.

major comments (3)
  1. [Gradient analysis] Gradient analysis section: The claim that the identity and motion objectives 'rapidly evolve toward nearly orthogonal update subspaces' is load-bearing for the central argument that stochastic switching suffices without subject-video pairs. The manuscript should provide quantitative support such as plots of gradient cosine similarity over training steps, variance across runs, or an ablation measuring motion coherence degradation when switching is removed or when identity updates dominate.
  2. [Experimental results] Experimental results section: The abstract and claims assert competitive subject fidelity and motion quality at 1% compute, yet no quantitative tables, metrics (e.g., subject similarity scores, motion quality metrics with error bars), or direct comparisons to VACE and Phantom are referenced in the provided description. This weakens the ability to evaluate the 1% compute advantage and competitiveness.
  3. [Training procedure] Training procedure section: The stochastic switching mechanism with random reference-frame sampling and image-token dropout is presented as preventing trivial copying, but the manuscript should include an ablation isolating the contribution of each component to final performance and to the observed orthogonality.
minor comments (2)
  1. [Evaluation metrics] Clarify the exact definition of 'subject fidelity' and 'motion quality' metrics used for competitiveness claims, including any human evaluation protocols.
  2. [Dataset description] Ensure all dataset sizes (200K subject-image pairs, 4,000 arbitrary videos) are consistently reported with details on curation and diversity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, outlining how we will strengthen the manuscript while preserving its core contributions on data-efficient zero-shot subject-driven video generation.

read point-by-point responses
  1. Referee: [Gradient analysis] Gradient analysis section: The claim that the identity and motion objectives 'rapidly evolve toward nearly orthogonal update subspaces' is load-bearing for the central argument that stochastic switching suffices without subject-video pairs. The manuscript should provide quantitative support such as plots of gradient cosine similarity over training steps, variance across runs, or an ablation measuring motion coherence degradation when switching is removed or when identity updates dominate.

    Authors: We appreciate the referee's focus on making the gradient analysis more rigorous. The manuscript already presents a gradient analysis showing that the objectives evolve toward nearly orthogonal subspaces, which underpins the stability of stochastic switching. To provide the requested quantitative support, we will add plots of gradient cosine similarity over training steps (with variance across runs) and an ablation measuring motion coherence degradation when switching is removed or identity updates dominate. These additions will be included in the revised gradient analysis section. revision: yes

  2. Referee: [Experimental results] Experimental results section: The abstract and claims assert competitive subject fidelity and motion quality at 1% compute, yet no quantitative tables, metrics (e.g., subject similarity scores, motion quality metrics with error bars), or direct comparisons to VACE and Phantom are referenced in the provided description. This weakens the ability to evaluate the 1% compute advantage and competitiveness.

    Authors: We acknowledge that the experimental claims would be stronger with more explicit quantitative backing in the main text. The manuscript reports competitiveness in subject fidelity and motion quality at 1% compute relative to VACE and Phantom, but we will expand the experimental results section to include dedicated tables with subject similarity scores, motion quality metrics (including error bars), and direct numerical comparisons to VACE and Phantom. These will be clearly referenced from the abstract and introduction in the revision. revision: yes

  3. Referee: [Training procedure] Training procedure section: The stochastic switching mechanism with random reference-frame sampling and image-token dropout is presented as preventing trivial copying, but the manuscript should include an ablation isolating the contribution of each component to final performance and to the observed orthogonality.

    Authors: We agree that component-wise ablations would clarify the design choices. The current training procedure uses stochastic switching combined with random reference-frame sampling and image-token dropout to avoid trivial copying while promoting orthogonality. In the revision, we will add an ablation study that isolates the contribution of each element (switching, reference-frame sampling, and token dropout) to both final performance metrics and the observed gradient orthogonality. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on empirical gradient analysis and standard training decomposition, not definitional reduction

full rationale

The paper's central claim decomposes SDV-Gen into identity injection (from subject-image pairs) and motion preservation (from arbitrary videos) optimized via stochastic switching. The load-bearing justification is the reported gradient analysis showing rapid evolution to nearly orthogonal update subspaces, which is presented as an empirical observation rather than a mathematical identity or fitted parameter renamed as prediction. No equations reduce the orthogonality result to the input data by construction, no self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled via prior work. The 288 A100-hour training outcome and competitive fidelity claims follow from the described procedure without tautological equivalence to the inputs. This is a standard empirical method paper whose performance claims are externally falsifiable via replication on the stated datasets and model.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the separability of identity and motion learning plus the empirical observation of orthogonal gradients; the chosen dataset sizes function as free parameters selected for the reported outcome.

free parameters (2)
  • Number of subject-image pairs = 200K
    200K pairs selected to achieve reported subject fidelity under the compute budget.
  • Number of arbitrary videos = 4000
    4,000 videos chosen as a small set sufficient for motion awareness.
axioms (2)
  • domain assumption Subject identity can be effectively learned from image pairs alone while motion awareness is maintained by arbitrary videos.
    Core premise of the two-task decomposition stated in the abstract.
  • domain assumption The identity and motion objectives evolve toward nearly orthogonal update subspaces during joint optimization.
    Invoked to explain stable training; supported by the paper's gradient analysis.

pith-pipeline@v0.9.0 · 5833 in / 1526 out tokens · 44219 ms · 2026-05-22T17:46:38.136527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 11 internal anchors

  1. [1]

    Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024

    Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024. 2

  2. [2]

    The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023

    Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023. 7

  3. [3]

    Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024. 2

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023. 1

  5. [5]

    Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang

    Kelvin C.K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang. Improving subject-driven image syn- thesis with subject-agnostic guidance. InCVPR, 2024. 3

  6. [6]

    Efficient lifelong learning with a- gem

    Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a- gem. InInternational Conference on Learning Representa- tions, 2019. 3

  7. [7]

    Still-moving: Cus- tomized video generation without customized video data

    Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, and Inbar Mosseri. Still-moving: Cus- tomized video generation without customized video data. arXiv:2407.08674, 2024. 2, 3, 7, 8, 22

  8. [8]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. 2

  9. [9]

    Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation

    Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation. InICLR, 2024. 3

  10. [10]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426, 2023. 3

  11. [11]

    Multi-subject open-set personalization in video generation

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. arXiv:2501.06187, 2025. 2, 3

  12. [12]

    Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

    Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025. 3

  13. [13]

    Freecustom: Tuning-free cus- tomized image generation for multi-concept composition

    Ganggui Ding et al. Freecustom: Tuning-free cus- tomized image generation for multi-concept composition. arXiv:2405.13870, 2024. 3

  14. [14]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3

  15. [15]

    Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

    Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 3

  16. [16]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H. Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 3

  17. [17]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2

  18. [18]

    Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024

    Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024. 3

  19. [19]

    Latent video diffusion models for high-fidelity long video generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. 2022. 2

  20. [20]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv:2205.15868, 2022. 1

  21. [21]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 4

  22. [22]

    Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025

    Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025. 2, 3

  23. [23]

    Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023

    Zhihao Hu and Dong Xu. Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023. 2

  24. [24]

    Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025

    Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025. 2, 3

  25. [25]

    VBench: Com- prehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, 9 Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In CVPR, 2024. 7

  26. [26]

    Videobooth: Diffusion-based video generation with image prompts

    Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, and Ziwei Liu. Videobooth: Diffusion-based video generation with image prompts. InCVPR, 2024. 2, 3, 6

  27. [27]

    VACE: All-in-One Video Creation and Editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv:2503.07598, 2025. 2, 3, 6, 21, 24, 25

  28. [28]

    Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis

    Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis. arXiv:2502.08244, 2025. 19, 20

  29. [29]

    Pexels-400k.https : / / huggingface

    jovianzm. Pexels-400k.https : / / huggingface . co/datasets/jovianzm/Pexels-400k, 2025. Ac- cessed: 2025-03-07. 2, 3, 5, 18, 20

  30. [30]

    Multi-concept customization of text- to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shecht- man, and Jun-Yan Zhu. Multi-concept customization of text- to-image diffusion. InCVPR, 2023. 3

  31. [31]

    Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

    Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. InNeurIPS, 2023. 2, 6

  32. [32]

    Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,

    Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,

  33. [33]

    Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei

    Lijie Liu, Tianxaing Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment. arXiv:2502.11079, 2025. 2, 3, 6, 21, 24, 25

  34. [34]

    Customizable image synthesis with multiple subjects

    Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. InAdvances in neural information processing systems, 2023. 3

  35. [35]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in neu- ral information processing systems, pages 6467–6476, 2017. 3

  36. [36]

    Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015. 3

  37. [37]

    T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models. InCVPR, 2023. 3

  38. [38]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

  39. [39]

    Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995

    Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995. 3

  40. [40]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models.arXiv:2112.10752, 2021. 2, 3, 6

  41. [41]

    Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 2, 3, 7

  42. [42]

    Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

    Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017. 3

  43. [43]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021. 3

  44. [44]

    OminiControl: Minimal and Universal Control for Diffusion Transformer,

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv:2411.15098, 2024. 2, 3, 4, 5, 6, 7, 8, 17, 21, 22, 23, 24, 25

  45. [45]

    Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020. 20

  46. [46]

    Training-free con- sistent text-to-image generation.TOG, 2024

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation.TOG, 2024. 3

  47. [47]

    Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion

    Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2161– 2172, 2025. 2

  48. [48]

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al

    Xiaowei Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance.arXiv:2406.07209, 2024. 3

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    WanTeam et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 6

  50. [50]

    Dreamvideo: Composing your dream videos with customized subject and motion

    Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhi- heng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hong- ming Shan. Dreamvideo: Composing your dream videos with customized subject and motion. InCVPR, 2024. 2, 3

  51. [51]

    Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control

    Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv:2410.13830, 2024. 2, 3

  52. [52]

    Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024

    Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024. 3

  53. [53]

    Custom- crafter: Customized video generation with preserving motion and concept composition abilities

    Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guang- cong Zheng, Zhongang Qi, Ying Shan, and Xi Li. Custom- crafter: Customized video generation with preserving motion and concept composition abilities. InAAAI, 2025. 2, 3, 7, 8, 21, 23

  54. [54]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffu- sion models with an expert transformer.arXiv:2408.06072,

  55. [55]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv:2308.06721, 2023. 2, 3, 6

  56. [56]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural Information Pro- cessing Systems, 2020. 3, 6, 8

  57. [57]

    Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024

    Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024. 2, 3

  58. [58]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025. 3, 20, 21

  59. [59]

    Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji

    Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation. InCVPR, 2024. 3

  60. [60]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3

  61. [61]

    Ssr-encoder: Encoding selective subject representation for subject-driven generation

    Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InCVPR, 2024. 3

  62. [62]

    Magic mirror: Id-preserved video generation in video diffusion transformers

    Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic mirror: Id-preserved video generation in video diffusion transformers. InICCV, 2025. 2, 3 11 Subject-driven Video Generation via Disentangled Identity and Motion Supplementary Material Note: We use green color to refer to figures, tables in the main manuscript (e.g.,...

  63. [63]

    Register backward hooks on modules containingθ train to copy per-parameter.gradinto a preallocated flat buffer in a fixed parameter order. 0 250 500 750 1000 Step 1.0 0.5 0.0 0.5 1.0 Gradient Alignment 0 250 500 750 1000 Step 0.0 0.5 1.0 1.5 2.0 2.5Grad Norm Image Video Figure 9.Gradient alignment between image and video batches during image-only finetuni...

  64. [64]

    Capture the flat gradient on each rank before the framework performs all-reduce

  65. [65]

    Reconstruct the global gradient by reducing the sum of the flat buffers across ranks and dividing by the world size (equivalent to an average pre-sync gradient at the current step)

  66. [66]

    Image-only

    Move the aggregated flat gradient to the CPU for logging to minimize device memory pressure. When gradient accumula- tion is used, we first accumulate local micro-batches, then cap- ture the pre-sync aggregate. Flattening and Layer-wise Grouping.Let˜g img(t)and ˜gvid(t)be the aggregated flat vectors formed by concatenat- ing per-parameter gradients fromθ ...

  67. [67]

    This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts

    Foreground–Background Segmentation.For each video, we use an off-the-shelf segmentation model (e.g., Grounded-SAM2) on thefirst frameto separate foreground and background regions. This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts

  68. [68]

    Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx

    Optical Flow Computation.We estimate optical flow between thefirst frameand each subsequent frame using a standard flow estimator (e.g., RAFT [45]). Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx. We record: FlowMagf = 1 Nf X x∈fg ∥uf(x)∥, FlowMagb = 1 Nb X x∈bg ∥ub(x)∥, whereN f...

  69. [69]

    This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion

    Dataset Filtering.To ensure negligible camera motion, we discardany video whose average magnitude of background flow FlowMagb exceeds 10 pixels. This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion

  70. [70]

    Category Assignment.Based on the average magnitude of foreground flow FlowMagf (averaged over all frames), we cate- gorize videos into: •Small:0≤FlowMag f ≤25 •Medium:25<FlowMag f ≤50 •Large: FlowMag f >50 Each category contains 300 videos, ensuring a balanced evaluation of low-, moderate-, and high-motion scenarios

  71. [71]

    pigskiing down a slope

    Evaluation Protocol.Within each subset, we use only thefirst frame(including any textual or reference cues, if required) to gen- erate a video of the same length. We then compute FVD [32] be- tween the generated outputs and the ground-truth videos. By com- paring FVD acrosssmall,medium, andlargemotion classes, we obtain a clearer picture of how each model...