pith. machine review for the scientific record. sign in

arxiv: 2604.13841 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

DiffMagicFace: Identity Consistent Facial Editing of Real Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial video editingdiffusion modelsidentity consistencytext-conditioned editingmulti-view datasetsimage editingtalking head videosrendering optimization
0
0 comments X

The pith

DiffMagicFace produces identity-consistent edits on real facial videos from text prompts by running two fine-tuned diffusion models together on multi-view image datasets generated without any video data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiffMagicFace as a framework for editing facial videos while keeping the same person's identity stable from frame to frame. It combines a text-controlled diffusion model and an image-controlled diffusion model that run at the same time during inference to follow the edit instruction and hold identity features. The key step is building, for each person, a collection of images from many angles created through rendering and optimization algorithms; this collection replaces the need for video training data. The method is shown to work on difficult cases such as talking-head sequences and faces that look alike, and the output quality is claimed to match results from conventional rendering pipelines.

Core claim

The central claim is that concurrent inference with two fine-tuned diffusion models, supported by a per-subject multi-view image dataset produced via rendering and optimization, suffices to generate edited video frames that preserve facial identity across time and follow text semantics, all without training or using any video datasets.

What carries the argument

Concurrent inference of a text-conditioned diffusion model and an image-conditioned diffusion model, anchored by a per-subject multi-view image dataset built through rendering and optimization.

If this is right

  • Edited videos remain consistent frame to frame even when the prompt involves speaking or head motion.
  • The same multi-view dataset works for distinguishing and editing closely related face categories.
  • Output quality reaches parity with videos produced by traditional rendering pipelines.
  • No video-specific training data is required to reach these consistency levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower the barrier to personalized video editing by letting users supply or generate only a small set of rendered views instead of video clips.
  • If the multi-view construction step can be made fast and automatic, the method might generalize to short-form content creation where video corpora are scarce.
  • Similar separation of text and image control streams might apply to non-face domains if analogous view-consistent anchors can be synthesized.
  • Quantitative gains reported against prior video-editing baselines suggest that identity drift in diffusion-based video methods is largely a data-composition problem rather than a fundamental architectural limit.

Load-bearing premise

A collection of multi-view images of the subject, made by rendering and optimization, is enough by itself to keep the edited identity stable across every frame of the output video.

What would settle it

Running the method on a talking-head video of a new subject for which no multi-view dataset was pre-generated and observing clear identity drift or loss of resemblance between early and late frames.

Figures

Figures reproduced from arXiv: 2604.13841 by Bin Wang, Huanghao Yin, Junhai Yong, Kanle Shi, Shenkun Xu.

Figure 1
Figure 1. Figure 1: Identity consistent face video editing. Our method achieves face video editing on a specific subject (e.g., half-frame glasses, or fat facial effect) using a text prompt, while preserving the identity of the source video and maintaining high consistency among frames. • Our approach uniquely creates a paired and edited train￾ing dataset without relying on specific video datasets, utilizing advanced renderin… view at source ↗
Figure 2
Figure 2. Figure 2: Inference pipeline. DiffMagicFace consists of two fine-tuned model based on latent diffusion models. During training, we fine-tune a text-control model and a image-control model. When sampling, the text-control model take the target text and a noisy latent code zt as input and the image-control model take a latent of source image and zt as input. Their noise prediction are made element-wise addition under … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of qualitative results with previous editing methods for the editing subject “glasses”. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Displayed the visual effects of different types of special effects applied in our method. Based on different text inputs, our method can edit highly [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distinguishing between the editing of similar subjects. Text guidances [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 1
Figure 1. Figure 1: Samples of out dataset. In this work, we have developed an automated script to invoke the interfaces provided by the software. Once the asset packages are imported and configured, the software can render the 30,000 facial images from the CelebaHQ dataset [44], yielding edited results. The generation time for each special effects category dataset is approximately three days. We are willing to fully disclose… view at source ↗
Figure 2
Figure 2. Figure 2: Quality Comparison with other video editing or generation works. [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DiffMagicFace, a text-conditioned facial video editing framework that integrates two fine-tuned diffusion models (one for text control and one for image control) operating concurrently at inference time. To enforce identity consistency across frames, it constructs a per-subject dataset of multi-view facial images generated via rendering techniques and optimization algorithms, without any reliance on video datasets or explicit temporal modeling. The authors claim that this yields high-quality, identity-consistent results even on complex tasks such as talking-head videos and fine-grained category distinctions, achieves parity with traditional rendering software, and outperforms state-of-the-art methods in both visual quality and quantitative metrics.

Significance. If validated, the result would be significant because it suggests that static multi-view image datasets can substitute for video data in enforcing temporal identity consistency, potentially reducing data-collection costs and enabling per-subject customization for video editing. The claimed parity with traditional rendering pipelines would be a strong practical outcome if supported by rigorous evidence.

major comments (2)
  1. [Abstract] Abstract: the central claim that a per-subject multi-view image dataset (created by rendering + optimization) is sufficient to enforce frame-to-frame identity consistency during inference, without video data or temporal modeling, is load-bearing for all subsequent assertions of superiority on talking-head videos and parity with traditional rendering. No analysis, ablation, or coverage argument is supplied showing that the rendered views span the non-rigid deformations, expression changes, and lighting variations present in real input videos; if the image-control signal fails to generalize, identity drift would occur and contradict the stated results.
  2. [Abstract] Abstract: the manuscript asserts 'superior performance in both visual appeal and quantitative metrics' and 'parity with videos that are made using traditional rendering software,' yet supplies no specific metrics, tables, error bars, or baseline comparisons. This evidentiary gap prevents verification of the quantitative claims and makes the superiority statement impossible to assess.
minor comments (1)
  1. The abstract refers to 'comparative analysis with current state-of-the-art methods' without naming the methods or describing the evaluation protocol (e.g., datasets, metrics, or number of subjects).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, clarifying our approach and indicating the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that a per-subject multi-view image dataset (created by rendering + optimization) is sufficient to enforce frame-to-frame identity consistency during inference, without video data or temporal modeling, is load-bearing for all subsequent assertions of superiority on talking-head videos and parity with traditional rendering. No analysis, ablation, or coverage argument is supplied showing that the rendered views span the non-rigid deformations, expression changes, and lighting variations present in real input videos; if the image-control signal fails to generalize, identity drift would occur and contradict the stated results.

    Authors: We appreciate the referee's emphasis on the need for explicit justification of this core design choice. The multi-view dataset is constructed per subject via rendering and optimization to capture the subject's identity across a range of viewpoints, with the image-control diffusion model then used at inference to anchor each edited frame to this identity representation. While our results on talking-head videos indicate that this suffices in practice to prevent drift, we acknowledge that the manuscript does not currently contain a dedicated coverage analysis or ablation quantifying how well the rendered views cover non-rigid deformations, expression changes, and lighting variations. In the revised manuscript we will add a new subsection in the method or experiments detailing the optimization procedure, the distribution of rendered views, and qualitative/quantitative evidence of generalization to unseen expressions and lighting in real input videos. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript asserts 'superior performance in both visual appeal and quantitative metrics' and 'parity with videos that are made using traditional rendering software,' yet supplies no specific metrics, tables, error bars, or baseline comparisons. This evidentiary gap prevents verification of the quantitative claims and makes the superiority statement impossible to assess.

    Authors: We agree that the abstract would be clearer if it directly referenced the supporting quantitative evidence. The full manuscript reports comparisons against state-of-the-art methods in the Experiments section using standard metrics for identity preservation, perceptual quality, and temporal consistency, along with user studies. To address the referee's concern, we will revise the abstract to cite the specific metrics and tables, ensure error bars are shown on all quantitative plots, and add an explicit side-by-side comparison (with metrics) against outputs from traditional rendering pipelines where feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses independent rendered dataset and concurrent model inference

full rationale

The paper presents DiffMagicFace as a framework that fine-tunes two control models (text and image) and creates a per-subject multi-view image dataset via rendering plus optimization to enforce identity consistency at inference without video data. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described approach. The central claim that the rendered image set suffices for video-frame consistency is an empirical assumption about generalization, not a derivation that reduces to its own inputs by construction. The absence of any mathematical chain or uniqueness theorem invoked from prior self-work keeps the score at 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract is a high-level description with no mathematical content, derivations, or explicit assumptions listed.

pith-pipeline@v0.9.0 · 5497 in / 1173 out tokens · 23012 ms · 2026-05-10T12:57:10.688289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 14 canonical work pages · 9 internal anchors

  1. [3]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  2. [4]

    Stitch it in time: Gan-based facial editing of real videos,

    R. Tzaban, R. Mokady, R. Gal, A. Bermano, and D. Cohen-Or, “Stitch it in time: Gan-based facial editing of real videos,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9

  3. [5]

    Video2stylegan: Dis- entangling local and global variations in a video,

    R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka, “Video2stylegan: Dis- entangling local and global variations in a video,”arXiv preprint arXiv:2205.13996, 2022

  4. [6]

    Stylegan-v: A con- tinuous video generator with the price, image quality and perks of stylegan2,

    I. Skorokhodov, S. Tulyakov, and M. Elhoseiny, “Stylegan-v: A con- tinuous video generator with the price, image quality and perks of stylegan2,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3626–3636

  5. [7]

    Exploiting spatial dimen- sions of latent in gan for real-time image editing,

    H. Kim, Y . Choi, J. Kim, S. Yoo, and Y . Uh, “Exploiting spatial dimen- sions of latent in gan for real-time image editing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 852–861

  6. [8]

    Sequential attention gan for interactive image editing,

    Y . Cheng, Z. Gan, Y . Li, J. Liu, and J. Gao, “Sequential attention gan for interactive image editing,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 4383–4391

  7. [9]

    Sine: Single image editing with text-to-image diffusion models,

    Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6027–6037

  8. [10]

    Instructpix2pix: Learning to follow image editing instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402

  9. [11]

    Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding,

    G. Kim, H. Shim, H. Kim, Y . Choi, J. Kim, and E. Yang, “Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6091–6100

  10. [12]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Person- alizing text-to-image generation using textual inversion,”arXiv preprint arXiv:2208.01618, 2022

  11. [13]

    Pix2video: Video editing using image diffusion,

    D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 206–23 217

  12. [14]

    Dreamix: Video diffusion models are general video editors

    E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y . Matias, Y . Pritch, Y . Leviathan, and Y . Hoshen, “Dreamix: Video diffusion models are general video editors,”arXiv preprint arXiv:2302.01329, 2023

  13. [15]

    Video-p2p: Video editing with cross-attention control,

    S. Liu, Y . Zhang, W. Li, Z. Lin, and J. Jia, “Video-p2p: Video editing with cross-attention control,”arXiv preprint arXiv:2303.04761, 2023

  14. [16]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,

    J. Z. Wu, Y . Ge, X. Wang, S. W. Lei, Y . Gu, Y . Shi, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7623–7633

  15. [17]

    A latent transformer for disentangled face editing in images and videos,

    X. Yao, A. Newson, Y . Gousseau, and P. Hellier, “A latent transformer for disentangled face editing in images and videos,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 789–13 798

  16. [18]

    Imagen Video: High Definition Video Generation with Diffusion Models

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleetet al., “Imagen video: High definition video generation with diffusion models,”arXiv preprint arXiv:2210.02303, 2022

  17. [19]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafniet al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022

  18. [20]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  19. [21]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational conference on machine learning. PMLR, 2015, pp. 2256–2265

  20. [22]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

  21. [23]

    Cascaded diffusion models for high fidelity image generation,

    J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,”The Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249–2281, 2022

  22. [24]

    Palette: Image-to-image diffusion models,

    C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” inACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10

  23. [25]

    Image super-resolution via iterative refinement,

    C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713– 4726, 2022

  24. [26]

    Generative modeling by estimating gradients of the data distribution,

    Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,”Advances in neural information processing systems, vol. 32, 2019

  25. [27]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021

  26. [28]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  27. [29]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,”Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022

  28. [30]

    Blended diffusion for text- driven editing of natural images,

    O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- driven editing of natural images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 208–18 218

  29. [31]

    Repaint: Inpainting using denoising diffusion probabilistic models,

    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 461–11 471

  30. [32]

    DiffEdit: Diffusion-based seman- tic image editing with mask guidance

    G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion- based semantic image editing with mask guidance,”arXiv preprint arXiv:2210.11427, 2022

  31. [33]

    Chatface: Chat- guided real face editing via diffusion latent space manipulation,

    D. Yue, Q. Guo, M. Ning, J. Cui, Y . Zhu, and L. Yuan, “Chatface: Chat- guided real face editing via diffusion latent space manipulation,”arXiv preprint arXiv:2305.14742, 2023

  32. [34]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023

  33. [35]

    Diffusion autoencoders: Toward a meaningful and decodable represen- tation,

    K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable represen- tation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 619–10 629

  34. [36]

    Stylegan-nada: Clip-guided domain adaptation of image generators,

    R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,”ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–13, 2022

  35. [37]

    Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations,

    D. Li, H. Ling, S. W. Kim, K. Kreis, S. Fidler, and A. Torralba, “Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 330–21 340

  36. [38]

    Classification accuracy score for conditional generative models,

    S. Ravuri and O. Vinyals, “Classification accuracy score for conditional generative models,”Advances in neural information processing systems, vol. 32, 2019

  37. [39]

    Learning from simulated and unsupervised images through adversarial training,

    A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2107–2116

  38. [40]

    Repurposing gans for one-shot semantic part segmentation,

    N. Tritrong, P. Rewatbowornwong, and S. Suwajanakorn, “Repurposing gans for one-shot semantic part segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4475–4485

  39. [41]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

  40. [42]

    Identity-aware and shape- aware propagation of face editing in videos,

    Y .-R. Jiang, S.-Y . Chen, H. Fu, and L. Gao, “Identity-aware and shape- aware propagation of face editing in videos,”IEEE Transactions on Visualization and Computer Graphics, 2023

  41. [43]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,”arXiv preprint arXiv:2111.02114, 2021

  42. [44]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,”arXiv preprint arXiv:1710.10196, 2017

  43. [45]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510

  44. [46]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Y . Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laineet al., “ediff-i: Text-to-image diffu- sion models with an ensemble of expert denoisers,”arXiv preprint arXiv:2211.01324, 2022

  45. [47]

    The computation of optical flow,

    S. S. Beauchemin and J. L. Barron, “The computation of optical flow,” ACM computing surveys (CSUR), vol. 27, no. 3, pp. 433–466, 1995

  46. [48]

    Reliable estimation of dense optical flow fields with large displacements,

    L. Alvarez, J. Weickert, and J. S ´anchez, “Reliable estimation of dense optical flow fields with large displacements,”International Journal of Computer Vision, vol. 39, pp. 41–56, 2000

  47. [49]

    A hybrid filtering approach of digital video stabilization for uav using kalman and low pass filter,

    L. Kejriwal and I. Singh, “A hybrid filtering approach of digital video stabilization for uav using kalman and low pass filter,”Procedia Computer Science, vol. 93, pp. 359–366, 2016

  48. [50]

    Beyond effects,

    Kwai, “Beyond effects,” https://effect.kuaishou.com/. SUPPLEMENTARYMATERIAL A. Our Dataset and Tools In the process of dataset creation, we employed the Be- yondEffect [50] rendering engine, a commercial software typically utilized by professional special effects artists. When employing this software, it is imperative for the creators to prefabricate asse...