Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

Chong Luo; Daneul Kim; Jaesik Park; Jingxu Zhang; Qi Dai; Sunghyun Cho; Wonjoon Jin

arxiv: 2504.17816 · v3 · submitted 2025-04-23 · 💻 cs.CV · eess.IV

Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

Daneul Kim , Jingxu Zhang , Wonjoon Jin , Sunghyun Cho , Qi Dai , Jaesik Park , Chong Luo This is my paper

Pith reviewed 2026-05-22 17:46 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords subject-driven video generationzero-shot adaptationefficient fine-tuningidentity injectionmotion preservationstochastic switchingvideo diffusion models

0 comments

The pith

A zero-shot method generates personalized videos by training once on subject images and random videos at 1% of prior compute costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes decomposing subject-driven video generation into identity injection learned from subject-image pairs and motion-awareness preservation from arbitrary videos. These tasks are jointly optimized through stochastic switching during training, supported by random reference-frame sampling and image-token dropout to avoid trivial copying. This yields a single adapted model that achieves competitive subject fidelity and motion quality using only 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours on CogVideoX-5B. A sympathetic reader would care because the method eliminates both per-subject test-time tuning and the need for large-scale paired subject-video data that previously drove massive compute requirements.

Core claim

By decomposing SDV-Gen into identity injection learned from subject-image pairs and motion-awareness preservation maintained by a small set of arbitrary videos, and optimizing the two tasks with stochastic switching using random reference-frame sampling and image-token dropout, a single model can be adapted with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours on CogVideoX-5B. This yields about 1% of the compute compared to prior zero-shot baselines while using no subject-video pairs and remaining competitive in subject fidelity and motion quality. The same recipe transfers to Wan 2.2-5B.

What carries the argument

Stochastic switching between identity injection and motion-awareness preservation tasks, with the two objectives shown by gradient analysis to evolve toward nearly orthogonal update subspaces.

If this is right

The training recipe transfers directly to other pretrained video models such as Wan 2.2-5B.
Subject fidelity and motion quality remain competitive with prior zero-shot methods that required orders of magnitude more compute and paired data.
No per-subject tuning is needed at test time.
Large-scale datasets of subject-video pairs are unnecessary for supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Lowering the resource barrier could make personalized video generation practical for smaller labs or consumer tools.
Similar decomposition into identity and motion objectives may apply to other conditional generation settings such as image or audio synthesis.
Scaling the approach to even larger base models could produce further efficiency gains or quality improvements.

Load-bearing premise

Identity injection from subject images and motion awareness from arbitrary videos can be jointly optimized via stochastic switching without any subject-video pairs because the objectives evolve to nearly orthogonal update subspaces.

What would settle it

Training the same model without stochastic switching produces either collapsed motion quality or a sharp drop in subject fidelity relative to the reported baselines.

Figures

Figures reproduced from arXiv: 2504.17816 by Chong Luo, Daneul Kim, Jaesik Park, Jingxu Zhang, Qi Dai, Sunghyun Cho, Wonjoon Jin.

**Figure 1.** Figure 1: Main results. Our method produces high-quality subject-driven video generation (SDV-Gen) in a zero-shot manner. We extend video generative models to have SDV-Gen capability using the small datasets. For the first two rows, we fine-tune CogVideoX-5B with 4,000 unpaired video set and 200K paired subject-image set. For the result in the last row, we use only 4,000 unpaired videos and 4,000 paired subject-imag… view at source ↗

**Figure 2.** Figure 2: Dual-task learning strategy. We formulate subjectdriven video generation as a dual-task problem. First is identity injection (Bottom) from paired subject-images, and second is motion-awareness preservation (Left), which we utilize unpaired videos and conduct stochastically switched learning. 2. Related Work 2.1. Subject-driven Image Generation Diffusion models have greatly advanced text-to-image generat… view at source ↗

**Figure 3.** Figure 3: Training and Inference Details. Left: During training, we stochastically alternate between two objectives: identity injection using paired subject-images and motion-awareness preservation using a small set of unpaired videos. Right: At inference time, no additional per-subject tuning is required. The model generates a video conditioned on the reference image and text prompt in a zero-shot manner. videos. … view at source ↗

**Figure 4.** Figure 4: Limitation of SDI-Gen→I2V method. With the subject presented small in the first frame, I2V fails to generate consistent results as it cannot interpret low-resolution subjects. 0 250 500 750 1000 Step 1.0 0.5 0.0 0.5 1.0 Gradient Alignment 0 250 500 750 1000 Step 0.0 0.2 0.4 0.6 0.8 Grad Norm Image Video [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Gradient analysis on alignment and norms during fine-tuning. Left: Cosine similarity ϕ(t) between gimg and gvid (over trainable parameters) quickly collapses to a narrow band near zero under dual-task training, indicating emergent nearorthogonality. Right: ℓ2 norms ∥gimg(t)∥2 and ∥gvid(t)∥2 remain non-negligible and similar in scale after 100-step. 4.1. Do the Gradients Become Orthogonal? [PITH_FULL_IMA… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with zero-shot methods (left) and per-subject tuning methods (right). Ours mini denotes our model fine-tuned with 4,000 subset of Subject-200K [44]. Note that ours is zero-shot tuning-free, requiring no tuning at inference time. we add qualitative comparison with Still-Moving [7] and CustomCrafter [53], using the results reported in their paper and supplements due to unavailable cod… view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison with PCGrad. our method demonstrates better detail, ID consistency, and motion-awareness than tuning-free baselines Vidu 2.0 and VideoBooth, and is competitive with Phantom and VACE with Wan. For the backpack example, our model faithfully reproduces fine textures and foreground motion, whereas Vidu 2.0 preserves ID but exhibits jittery trajectories, and VideoBooth yields the weakest … view at source ↗

**Figure 9.** Figure 9: Gradient alignment between image and video batches during image-only finetuning and gradient norms of two objectives. Recorded gradient alignment and norm every 50 steps. 2. Capture the flat gradient on each rank before the framework performs all-reduce. 3. Reconstruct the global gradient by reducing the sum of the flat buffers across ranks and dividing by the world size (equivalent to an average pre-sync… view at source ↗

**Figure 10.** Figure 10: Gradient alignment comparison of the two objectives when applying PCGrad or not. Red and Orange line indicate the alignment result with PCGrad with different buffer sizes, and blue dashed line indicates the alignment without PCGrad, equivalent to [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation on varying S&I set size used for training. Up to a subset count of 4K, they exhibit successful identity, but when reduced to 2K (≈ 1% of Subject200K), they show failure from time to time. Ours w/o random image drop w/o random initial frame + “A close up view. A bowl of oranges placed on a wooden table. The background is a dark room, the TV is on, and the screen is showing a cooking show.” +"On th… view at source ↗

**Figure 13.** Figure 13: Qualitative result on ablation study of our component in temporal awareness preservation [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Additional qualitative comparison with Still-Moving [7]. Note that mini denotes our method trained with a 4K subset of Subject 200K [44]. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Additional qualitative comparison with CustomCrafter [53]. Note that mini denotes our method trained with 4K subset of Subject 200K [44]. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Additional qualitative comparison with tuning-free baselines including VACE Wan-1.3B [27], Phantom Wan-1.3B [33]. Note that mini denotes our method trained with 4K subset of Subject 200K [44] 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Additional qualitative comparison with tuning-free baselines including VACE Wan-1.3B [27], Phantom Wan-1.3B [33]. Note that mini denotes our method trained with 4K subset of Subject 200K [44] 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

read the original abstract

Subject-driven video generation (SDV-Gen) aims to produce videos of a specific subject by adapting a pretrained video model, enabling personalized and application-driven content creation. To achieve this goal, per-subject tuning methods require approximately 200 A100 GPU hours to generate a customized video, whereas zero-shot methods avoid per-subject tuning but typically rely on millions of subject-video pairs for the supervision, incurring massive network fine-tuning costs (10K-200K A100 GPU hours). We propose a data- and compute-efficient zero-shot SDV-Gen framework that avoids test-time per-subject tuning and the use of large-scale subject-video pairs. Our key idea decomposes SDV-Gen into (i) identity injection learned from subject-image pairs and (ii) motion-awareness preservation maintained by a small set of arbitrary videos. We optimize the two tasks with stochastic switching, using random reference-frame sampling and image-token dropout to prevent trivial first-frame copying. Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization. Using CogVideoX-5B, we adapt a single model with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours. This yields about 1% of compute compared to prior zero-shot baselines (i.e., 0.4% of VACE and 2.8% of Phantom) while using no subject-video pairs, yet remaining competitive in subject fidelity and motion quality. We show that the same recipe transfers to Wan 2.2-5B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a zero-shot subject-driven video generation (SDV-Gen) framework that decomposes the task into identity injection learned from 200K subject-image pairs and motion-awareness preservation from 4,000 arbitrary videos. These are jointly optimized on a pretrained model (CogVideoX-5B) via stochastic switching combined with random reference-frame sampling and image-token dropout, without requiring subject-video pairs. Gradient analysis is invoked to show that the objectives evolve toward nearly orthogonal update subspaces, enabling stable training in 288 A100 GPU hours (claimed ~1% of prior zero-shot baselines) while remaining competitive in subject fidelity and motion quality. The recipe is reported to transfer to Wan 2.2-5B.

Significance. If the efficiency and competitiveness claims are substantiated, the work would represent a substantial advance in lowering the data and compute barriers for personalized video generation, potentially making subject-driven methods more accessible. The decomposition strategy and gradient-based justification for avoiding paired data constitute a technically interesting approach to multi-objective fine-tuning in diffusion models.

major comments (3)

[Gradient analysis] Gradient analysis section: The claim that the identity and motion objectives 'rapidly evolve toward nearly orthogonal update subspaces' is load-bearing for the central argument that stochastic switching suffices without subject-video pairs. The manuscript should provide quantitative support such as plots of gradient cosine similarity over training steps, variance across runs, or an ablation measuring motion coherence degradation when switching is removed or when identity updates dominate.
[Experimental results] Experimental results section: The abstract and claims assert competitive subject fidelity and motion quality at 1% compute, yet no quantitative tables, metrics (e.g., subject similarity scores, motion quality metrics with error bars), or direct comparisons to VACE and Phantom are referenced in the provided description. This weakens the ability to evaluate the 1% compute advantage and competitiveness.
[Training procedure] Training procedure section: The stochastic switching mechanism with random reference-frame sampling and image-token dropout is presented as preventing trivial copying, but the manuscript should include an ablation isolating the contribution of each component to final performance and to the observed orthogonality.

minor comments (2)

[Evaluation metrics] Clarify the exact definition of 'subject fidelity' and 'motion quality' metrics used for competitiveness claims, including any human evaluation protocols.
[Dataset description] Ensure all dataset sizes (200K subject-image pairs, 4,000 arbitrary videos) are consistently reported with details on curation and diversity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, outlining how we will strengthen the manuscript while preserving its core contributions on data-efficient zero-shot subject-driven video generation.

read point-by-point responses

Referee: [Gradient analysis] Gradient analysis section: The claim that the identity and motion objectives 'rapidly evolve toward nearly orthogonal update subspaces' is load-bearing for the central argument that stochastic switching suffices without subject-video pairs. The manuscript should provide quantitative support such as plots of gradient cosine similarity over training steps, variance across runs, or an ablation measuring motion coherence degradation when switching is removed or when identity updates dominate.

Authors: We appreciate the referee's focus on making the gradient analysis more rigorous. The manuscript already presents a gradient analysis showing that the objectives evolve toward nearly orthogonal subspaces, which underpins the stability of stochastic switching. To provide the requested quantitative support, we will add plots of gradient cosine similarity over training steps (with variance across runs) and an ablation measuring motion coherence degradation when switching is removed or identity updates dominate. These additions will be included in the revised gradient analysis section. revision: yes
Referee: [Experimental results] Experimental results section: The abstract and claims assert competitive subject fidelity and motion quality at 1% compute, yet no quantitative tables, metrics (e.g., subject similarity scores, motion quality metrics with error bars), or direct comparisons to VACE and Phantom are referenced in the provided description. This weakens the ability to evaluate the 1% compute advantage and competitiveness.

Authors: We acknowledge that the experimental claims would be stronger with more explicit quantitative backing in the main text. The manuscript reports competitiveness in subject fidelity and motion quality at 1% compute relative to VACE and Phantom, but we will expand the experimental results section to include dedicated tables with subject similarity scores, motion quality metrics (including error bars), and direct numerical comparisons to VACE and Phantom. These will be clearly referenced from the abstract and introduction in the revision. revision: yes
Referee: [Training procedure] Training procedure section: The stochastic switching mechanism with random reference-frame sampling and image-token dropout is presented as preventing trivial copying, but the manuscript should include an ablation isolating the contribution of each component to final performance and to the observed orthogonality.

Authors: We agree that component-wise ablations would clarify the design choices. The current training procedure uses stochastic switching combined with random reference-frame sampling and image-token dropout to avoid trivial copying while promoting orthogonality. In the revision, we will add an ablation study that isolates the contribution of each element (switching, reference-frame sampling, and token dropout) to both final performance metrics and the observed gradient orthogonality. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on empirical gradient analysis and standard training decomposition, not definitional reduction

full rationale

The paper's central claim decomposes SDV-Gen into identity injection (from subject-image pairs) and motion preservation (from arbitrary videos) optimized via stochastic switching. The load-bearing justification is the reported gradient analysis showing rapid evolution to nearly orthogonal update subspaces, which is presented as an empirical observation rather than a mathematical identity or fitted parameter renamed as prediction. No equations reduce the orthogonality result to the input data by construction, no self-citation chain is invoked to justify uniqueness, and no ansatz is smuggled via prior work. The 288 A100-hour training outcome and competitive fidelity claims follow from the described procedure without tautological equivalence to the inputs. This is a standard empirical method paper whose performance claims are externally falsifiable via replication on the stated datasets and model.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the separability of identity and motion learning plus the empirical observation of orthogonal gradients; the chosen dataset sizes function as free parameters selected for the reported outcome.

free parameters (2)

Number of subject-image pairs = 200K
200K pairs selected to achieve reported subject fidelity under the compute budget.
Number of arbitrary videos = 4000
4,000 videos chosen as a small set sufficient for motion awareness.

axioms (2)

domain assumption Subject identity can be effectively learned from image pairs alone while motion awareness is maintained by arbitrary videos.
Core premise of the two-task decomposition stated in the abstract.
domain assumption The identity and motion objectives evolve toward nearly orthogonal update subspaces during joint optimization.
Invoked to explain stable training; supported by the paper's gradient analysis.

pith-pipeline@v0.9.0 · 5833 in / 1526 out tokens · 44219 ms · 2026-05-22T17:46:38.136527+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization... stochastic task-switching... p=0.2
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposes SDV-Gen into (i) identity injection... (ii) motion-awareness preservation... 288 A100 GPU hours... 1% of compute

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 11 internal anchors

[1]

Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024. 2

work page arXiv 2024
[2]

The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023. 7

work page arXiv 2023
[3]

Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024. 2

work page arXiv 2024
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang

Kelvin C.K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang. Improving subject-driven image syn- thesis with subject-agnostic guidance. InCVPR, 2024. 3

work page 2024
[6]

Efficient lifelong learning with a- gem

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a- gem. InInternational Conference on Learning Representa- tions, 2019. 3

work page 2019
[7]

Still-moving: Cus- tomized video generation without customized video data

Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, and Inbar Mosseri. Still-moving: Cus- tomized video generation without customized video data. arXiv:2407.08674, 2024. 2, 3, 7, 8, 22

work page arXiv 2024
[8]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. 2

work page 2024
[9]

Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation

Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation. InICLR, 2024. 3

work page 2024
[10]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. arXiv:2501.06187, 2025. 2, 3

work page arXiv 2025
[12]

Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025. 3

work page arXiv 2025
[13]

Freecustom: Tuning-free cus- tomized image generation for multi-concept composition

Ganggui Ding et al. Freecustom: Tuning-free cus- tomized image generation for multi-concept composition. arXiv:2405.13870, 2024. 3

work page arXiv 2024
[14]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3

work page 2024
[15]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 3

work page 1999
[16]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H. Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024. 3

work page arXiv 2024
[19]

Latent video diffusion models for high-fidelity long video generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. 2022. 2

work page 2022
[20]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv:2205.15868, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 4

work page 2022
[22]

Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025. 2, 3

work page 2025
[23]

Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023

Zhihao Hu and Dong Xu. Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023. 2

work page arXiv 2023
[24]

Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025. 2, 3

work page arXiv 2025
[25]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, 9 Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In CVPR, 2024. 7

work page 2024
[26]

Videobooth: Diffusion-based video generation with image prompts

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, and Ziwei Liu. Videobooth: Diffusion-based video generation with image prompts. InCVPR, 2024. 2, 3, 6

work page 2024
[27]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv:2503.07598, 2025. 2, 3, 6, 21, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis. arXiv:2502.08244, 2025. 19, 20

work page arXiv 2025
[29]

Pexels-400k.https : / / huggingface

jovianzm. Pexels-400k.https : / / huggingface . co/datasets/jovianzm/Pexels-400k, 2025. Ac- cessed: 2025-03-07. 2, 3, 5, 18, 20

work page 2025
[30]

Multi-concept customization of text- to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shecht- man, and Jun-Yan Zhu. Multi-concept customization of text- to-image diffusion. InCVPR, 2023. 3

work page 2023
[31]

Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. InNeurIPS, 2023. 2, 6

work page 2023
[32]

Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,

Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,

work page arXiv
[33]

Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei

Lijie Liu, Tianxaing Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment. arXiv:2502.11079, 2025. 2, 3, 6, 21, 24, 25

work page arXiv 2025
[34]

Customizable image synthesis with multiple subjects

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. InAdvances in neural information processing systems, 2023. 3

work page 2023
[35]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in neu- ral information processing systems, pages 6467–6476, 2017. 3

work page 2017
[36]

Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015. 3

work page 2015
[37]

T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models. InCVPR, 2023. 3

work page 2023
[38]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

work page 2023
[39]

Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995

Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995. 3

work page 1995
[40]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models.arXiv:2112.10752, 2021. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 2, 3, 7

work page 2023
[42]

Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[43]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

OminiControl: Minimal and Universal Control for Diffusion Transformer,

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv:2411.15098, 2024. 2, 3, 4, 5, 6, 7, 8, 17, 21, 22, 23, 24, 25

work page arXiv 2024
[45]

Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020. 20

work page arXiv 2003
[46]

Training-free con- sistent text-to-image generation.TOG, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation.TOG, 2024. 3

work page 2024
[47]

Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion

Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2161– 2172, 2025. 2

work page 2025
[48]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al

Xiaowei Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance.arXiv:2406.07209, 2024. 3

work page arXiv 2024
[49]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Dreamvideo: Composing your dream videos with customized subject and motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhi- heng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hong- ming Shan. Dreamvideo: Composing your dream videos with customized subject and motion. InCVPR, 2024. 2, 3

work page 2024
[51]

Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv:2410.13830, 2024. 2, 3

work page arXiv 2024
[52]

Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024

Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024. 3

work page arXiv 2024
[53]

Custom- crafter: Customized video generation with preserving motion and concept composition abilities

Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guang- cong Zheng, Zhongang Qi, Ying Shan, and Xi Li. Custom- crafter: Customized video generation with preserving motion and concept composition abilities. InAAAI, 2025. 2, 3, 7, 8, 21, 23

work page 2025
[54]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffu- sion models with an expert transformer.arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv:2308.06721, 2023. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural Information Pro- cessing Systems, 2020. 3, 6, 8

work page 2020
[57]

Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024. 2, 3

work page arXiv 2024
[58]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025. 3, 20, 21

work page arXiv 2025
[59]

Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji

Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation. InCVPR, 2024. 3

work page 2024
[60]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3

work page 2023
[61]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InCVPR, 2024. 3

work page 2024
[62]

Magic mirror: Id-preserved video generation in video diffusion transformers

Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic mirror: Id-preserved video generation in video diffusion transformers. InICCV, 2025. 2, 3 11 Subject-driven Video Generation via Disentangled Identity and Motion Supplementary Material Note: We use green color to refer to figures, tables in the main manuscript (e.g.,...

work page 2025
[63]

Register backward hooks on modules containingθ train to copy per-parameter.gradinto a preallocated flat buffer in a fixed parameter order. 0 250 500 750 1000 Step 1.0 0.5 0.0 0.5 1.0 Gradient Alignment 0 250 500 750 1000 Step 0.0 0.5 1.0 1.5 2.0 2.5Grad Norm Image Video Figure 9.Gradient alignment between image and video batches during image-only finetuni...

work page
[64]

Capture the flat gradient on each rank before the framework performs all-reduce

work page
[65]

Reconstruct the global gradient by reducing the sum of the flat buffers across ranks and dividing by the world size (equivalent to an average pre-sync gradient at the current step)

work page
[66]

Image-only

Move the aggregated flat gradient to the CPU for logging to minimize device memory pressure. When gradient accumula- tion is used, we first accumulate local micro-batches, then cap- ture the pre-sync aggregate. Flattening and Layer-wise Grouping.Let˜g img(t)and ˜gvid(t)be the aggregated flat vectors formed by concatenat- ing per-parameter gradients fromθ ...

work page
[67]

This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts

Foreground–Background Segmentation.For each video, we use an off-the-shelf segmentation model (e.g., Grounded-SAM2) on thefirst frameto separate foreground and background regions. This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts

work page
[68]

Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx

Optical Flow Computation.We estimate optical flow between thefirst frameand each subsequent frame using a standard flow estimator (e.g., RAFT [45]). Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx. We record: FlowMagf = 1 Nf X x∈fg ∥uf(x)∥, FlowMagb = 1 Nb X x∈bg ∥ub(x)∥, whereN f...

work page
[69]

This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion

Dataset Filtering.To ensure negligible camera motion, we discardany video whose average magnitude of background flow FlowMagb exceeds 10 pixels. This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion

work page
[70]

Category Assignment.Based on the average magnitude of foreground flow FlowMagf (averaged over all frames), we cate- gorize videos into: •Small:0≤FlowMag f ≤25 •Medium:25<FlowMag f ≤50 •Large: FlowMag f >50 Each category contains 300 videos, ensuring a balanced evaluation of low-, moderate-, and high-motion scenarios

work page
[71]

pigskiing down a slope

Evaluation Protocol.Within each subset, we use only thefirst frame(including any textual or reference cues, if required) to gen- erate a video of the same length. We then compute FVD [32] be- tween the generated outputs and the ground-truth videos. By com- paring FVD acrosssmall,medium, andlargemotion classes, we obtain a clearer picture of how each model...

work page

[1] [1]

Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, and Gal Chechik. Multi-shot character consistency for text-to- video generation.arXiv:2412.07750, 2024. 2

work page arXiv 2024

[2] [2]

The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text- to-image diffusion models.arXiv:2311.10093, 2023. 7

work page arXiv 2023

[3] [3]

Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024. 2

work page arXiv 2024

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang

Kelvin C.K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, and Huisheng Wang. Improving subject-driven image syn- thesis with subject-agnostic guidance. InCVPR, 2024. 3

work page 2024

[6] [6]

Efficient lifelong learning with a- gem

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a- gem. InInternational Conference on Learning Representa- tions, 2019. 3

work page 2019

[7] [7]

Still-moving: Cus- tomized video generation without customized video data

Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, and Inbar Mosseri. Still-moving: Cus- tomized video generation without customized video data. arXiv:2407.08674, 2024. 2, 3, 7, 8, 22

work page arXiv 2024

[8] [8]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. 2

work page 2024

[9] [9]

Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation

Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation. InICLR, 2024. 3

work page 2024

[10] [10]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. arXiv:2501.06187, 2025. 2, 3

work page arXiv 2025

[12] [12]

Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025. 3

work page arXiv 2025

[13] [13]

Freecustom: Tuning-free cus- tomized image generation for multi-concept composition

Ganggui Ding et al. Freecustom: Tuning-free cus- tomized image generation for multi-concept composition. arXiv:2405.13870, 2024. 3

work page arXiv 2024

[14] [14]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3

work page 2024

[15] [15]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999. 3

work page 1999

[16] [16]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H. Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024. 3

work page arXiv 2024

[19] [19]

Latent video diffusion models for high-fidelity long video generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. 2022. 2

work page 2022

[20] [20]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv:2205.15868, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 4

work page 2022

[22] [22]

Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video gener- ation, 2025. 2, 3

work page 2025

[23] [23]

Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023

Zhihao Hu and Dong Xu. Videocontrolnet: A motion- guided video-to-video translation framework by using dif- fusion model with controlnet.arXiv:2307.14073, 2023. 2

work page arXiv 2023

[24] [24]

Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025. 2, 3

work page arXiv 2025

[25] [25]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, 9 Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In CVPR, 2024. 7

work page 2024

[26] [26]

Videobooth: Diffusion-based video generation with image prompts

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, and Ziwei Liu. Videobooth: Diffusion-based video generation with image prompts. InCVPR, 2024. 2, 3, 6

work page 2024

[27] [27]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv:2503.07598, 2025. 2, 3, 6, 21, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffu- sion model for enhanced camera-controlled video synthesis. arXiv:2502.08244, 2025. 19, 20

work page arXiv 2025

[29] [29]

Pexels-400k.https : / / huggingface

jovianzm. Pexels-400k.https : / / huggingface . co/datasets/jovianzm/Pexels-400k, 2025. Ac- cessed: 2025-03-07. 2, 3, 5, 18, 20

work page 2025

[30] [30]

Multi-concept customization of text- to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shecht- man, and Jun-Yan Zhu. Multi-concept customization of text- to-image diffusion. InCVPR, 2023. 3

work page 2023

[31] [31]

Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. InNeurIPS, 2023. 2, 6

work page 2023

[32] [32]

Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,

Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fr´echet video motion distance: A metric for evaluating motion consistency in videos.arXiv:2407.16124,

work page arXiv

[33] [33]

Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei

Lijie Liu, Tianxaing Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment. arXiv:2502.11079, 2025. 2, 3, 6, 21, 24, 25

work page arXiv 2025

[34] [34]

Customizable image synthesis with multiple subjects

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. InAdvances in neural information processing systems, 2023. 3

work page 2023

[35] [35]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in neu- ral information processing systems, pages 6467–6476, 2017. 3

work page 2017

[36] [36]

Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.Nature, 518(7540):529–533, 2015. 3

work page 2015

[37] [37]

T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models. InCVPR, 2023. 3

work page 2023

[38] [38]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

work page 2023

[39] [39]

Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995

Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123–146, 1995. 3

work page 1995

[40] [40]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models.arXiv:2112.10752, 2021. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine- tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 2, 3, 7

work page 2023

[42] [42]

Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017. 3

work page 2017

[43] [43]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

OminiControl: Minimal and Universal Control for Diffusion Transformer,

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv:2411.15098, 2024. 2, 3, 4, 5, 6, 7, 8, 17, 21, 22, 23, 24, 25

work page arXiv 2024

[45] [45]

Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020. 20

work page arXiv 2003

[46] [46]

Training-free con- sistent text-to-image generation.TOG, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation.TOG, 2024. 3

work page 2024

[47] [47]

Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion

Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seeding in- finity in diffusion transformer towards generic video restora- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2161– 2172, 2025. 2

work page 2025

[48] [48]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al

Xiaowei Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance.arXiv:2406.07209, 2024. 3

work page arXiv 2024

[49] [49]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Dreamvideo: Composing your dream videos with customized subject and motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhi- heng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hong- ming Shan. Dreamvideo: Composing your dream videos with customized subject and motion. InCVPR, 2024. 2, 3

work page 2024

[51] [51]

Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv:2410.13830, 2024. 2, 3

work page arXiv 2024

[52] [52]

Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024

Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Mo- tionbooth: Motion-aware customized text-to-video genera- tion.arXiv:2406.17758, 2024. 3

work page arXiv 2024

[53] [53]

Custom- crafter: Customized video generation with preserving motion and concept composition abilities

Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guang- cong Zheng, Zhongang Qi, Ying Shan, and Xi Li. Custom- crafter: Customized video generation with preserving motion and concept composition abilities. InAAAI, 2025. 2, 3, 7, 8, 21, 23

work page 2025

[54] [54]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffu- sion models with an expert transformer.arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv:2308.06721, 2023. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural Information Pro- cessing Systems, 2020. 3, 6, 8

work page 2020

[57] [57]

Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decompo- sition.arXiv:2411.17440, 2024. 2, 3

work page arXiv 2024

[58] [58]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025. 3, 20, 21

work page arXiv 2025

[59] [59]

Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji

Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation. InCVPR, 2024. 3

work page 2024

[60] [60]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3

work page 2023

[61] [61]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InCVPR, 2024. 3

work page 2024

[62] [62]

Magic mirror: Id-preserved video generation in video diffusion transformers

Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic mirror: Id-preserved video generation in video diffusion transformers. InICCV, 2025. 2, 3 11 Subject-driven Video Generation via Disentangled Identity and Motion Supplementary Material Note: We use green color to refer to figures, tables in the main manuscript (e.g.,...

work page 2025

[63] [63]

Register backward hooks on modules containingθ train to copy per-parameter.gradinto a preallocated flat buffer in a fixed parameter order. 0 250 500 750 1000 Step 1.0 0.5 0.0 0.5 1.0 Gradient Alignment 0 250 500 750 1000 Step 0.0 0.5 1.0 1.5 2.0 2.5Grad Norm Image Video Figure 9.Gradient alignment between image and video batches during image-only finetuni...

work page

[64] [64]

Capture the flat gradient on each rank before the framework performs all-reduce

work page

[65] [65]

Reconstruct the global gradient by reducing the sum of the flat buffers across ranks and dividing by the world size (equivalent to an average pre-sync gradient at the current step)

work page

[66] [66]

Image-only

Move the aggregated flat gradient to the CPU for logging to minimize device memory pressure. When gradient accumula- tion is used, we first accumulate local micro-batches, then cap- ture the pre-sync aggregate. Flattening and Layer-wise Grouping.Let˜g img(t)and ˜gvid(t)be the aggregated flat vectors formed by concatenat- ing per-parameter gradients fromθ ...

work page

[67] [67]

This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts

Foreground–Background Segmentation.For each video, we use an off-the-shelf segmentation model (e.g., Grounded-SAM2) on thefirst frameto separate foreground and background regions. This allows us to measure object (foreground) motion indepen- dently from any camera-induced background shifts

work page

[68] [68]

Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx

Optical Flow Computation.We estimate optical flow between thefirst frameand each subsequent frame using a standard flow estimator (e.g., RAFT [45]). Letu f(x)andu b(x)denote the per- pixel flow vectors for the foreground and background pixels, re- spectively, at positionx. We record: FlowMagf = 1 Nf X x∈fg ∥uf(x)∥, FlowMagb = 1 Nb X x∈bg ∥ub(x)∥, whereN f...

work page

[69] [69]

This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion

Dataset Filtering.To ensure negligible camera motion, we discardany video whose average magnitude of background flow FlowMagb exceeds 10 pixels. This filtering step excludes scenes with significant global shifts, retaining only those with primarily object-centric motion

work page

[70] [70]

Category Assignment.Based on the average magnitude of foreground flow FlowMagf (averaged over all frames), we cate- gorize videos into: •Small:0≤FlowMag f ≤25 •Medium:25<FlowMag f ≤50 •Large: FlowMag f >50 Each category contains 300 videos, ensuring a balanced evaluation of low-, moderate-, and high-motion scenarios

work page

[71] [71]

pigskiing down a slope

Evaluation Protocol.Within each subset, we use only thefirst frame(including any textual or reference cues, if required) to gen- erate a video of the same length. We then compute FVD [32] be- tween the generated outputs and the ground-truth videos. By com- paring FVD acrosssmall,medium, andlargemotion classes, we obtain a clearer picture of how each model...

work page