pith. sign in

arxiv: 2411.16748 · v5 · submitted 2024-11-24 · 💻 cs.CV

Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

Pith reviewed 2026-05-23 17:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords talking video generationdiffusion transformermemory bankmultimodal fusiontemporal coherencelong video synthesisvideo generation efficiencyspatiotemporal modeling
0
0 comments X

The pith

A diffusion transformer with a noise-regularized memory bank generates long-duration talking videos that stay coherent and realistic while using eight times fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the breakdown in quality, consistency, and coherence that occurs when synthesizing talking videos over extended lengths. It adds a memory bank that stores prior context and regularizes it with noise to limit error buildup during generation. A compressed autoencoder and a linear-attention transformer handle the multimodal inputs efficiently. Tests of fusion strategies show that deep fusion on portrait features paired with shallow fusion on audio yields the best balance of realism, speech accuracy, and movement variety. The result is claimed to deliver higher-quality output at far lower parameter cost than earlier methods.

Core claim

The proposed framework is a diffusion transformer that maintains contextual continuity for long-duration talking video generation using a noise-regularized memory bank along with a deep compression autoencoder and a spatiotemporal transformer, achieving superior quality and efficiency with eight times fewer parameters by combining symbiotic fusion for portrait features with direct fusion for audio.

What carries the argument

The noise-regularized memory bank, which stores contextual information from prior frames and adds noise to reduce error accumulation and sampling artifacts in long sequences.

Load-bearing premise

The performance gains arise primarily from the memory bank and the specific portrait-audio fusion choices rather than from training data or other unstated details.

What would settle it

Generate the same long video sequences both with and without the noise-regularized memory bank and check whether temporal artifacts, portrait drift, and error accumulation rise sharply in the version that lacks the bank.

Figures

Figures reproduced from arXiv: 2411.16748 by Bingyan Liu, Haojie Zhang, Jianhua Tao, Ruibo Fu, Xuefei Liu, Yaling Liang, Zhengqi Wen, Zhihao Liang.

Figure 1
Figure 1. Figure 1: We introduce LetsTalk, a diffusion-based transformer for audio-driven portrait animation. Given a reference image and audio, LetsTalk generates realistic videos with synchronized mouth motions. As shown in the Left figure, each column corresponds to the same audio, demonstrating consistent and accurate lip movements. The Right figure compares generation quality and inference time on the HDTF dataset, where… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our LetsTalk framework for robust long-duration talking head video generation. Our system combines a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of the long-duration generation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multimodal fusion schemes: (a) Direct Fusion injects conditions via cross-attention modules; (b) Siamese Fusion uses parallel transformer for feature guidance; (c) Symbiotic Fusion achieves fusion through input concatenation and self-attention. The backbone architecture (left-side blocks) remains consistent across all approaches. Schemes Integration State Params Modality Adapt. Direct Cross-attention Stati… view at source ↗
Figure 5
Figure 5. Figure 5: The qualitative comparisons with other cutting-edge methods on the HDTF dataset. Our method achieves better audio [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The qualitative comparisons with the existing portrait image animation approaches on the CelebV-HQ dataset. Our [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Portrait animation results without audio guidance for [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: User study results (%) on (a) realism (left) and (b) synchronization (right). the diversity of character actions in the generated videos. The results in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait consistency, temporal coherence, and computational efficiency. As video length increases, issues such as visual degradation, portrait drift, temporal artifacts, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal modeling, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LetsTalk, a multimodal diffusion transformer framework for long-duration talking video generation. It proposes a noise-regularized memory bank to alleviate error accumulation and sampling artifacts, a deep compression autoencoder, and a linear-attention spatiotemporal transformer. The authors systematically compare three multimodal fusion schemes and conclude that Symbiotic Fusion for portrait features paired with Direct Fusion for audio yields superior visual realism, speech-driven motion, and movement diversity. The work claims new state-of-the-art results in generation quality, temporal coherence, realism, diversity, and liveliness, together with an 8x reduction in parameters relative to prior approaches.

Significance. If the empirical results hold, the work would be a meaningful engineering contribution to scalable talking-head video synthesis by demonstrating practical mechanisms for long-sequence coherence and parameter efficiency. The explicit analysis of fusion strategies and the noise-regularized memory bank are concrete, testable advances that could influence subsequent multimodal diffusion designs. The claimed 8x parameter reduction, if substantiated with controlled comparisons, would be a notable strength for deployment-oriented applications.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim that the noise-regularized memory bank together with the Symbiotic/Direct fusion combination are the primary drivers of gains in temporal coherence, realism, and efficiency is not supported by ablations that isolate these components from dataset choices, training schedule, or other implementation details. Without such controls, attribution of the reported SOTA performance remains uncertain.
  2. [Abstract] Abstract: The manuscript asserts 'new state-of-the-art in generation quality' and '8x fewer parameters' yet supplies no quantitative metrics, baseline comparisons, error bars, or dataset statistics in the provided text. These numbers are load-bearing for the central empirical claim and cannot be verified from the available sections.
minor comments (2)
  1. [§3.3] The description of the three fusion schemes would benefit from an explicit diagram or pseudocode showing the exact information flow between portrait, audio, and latent features.
  2. [§3.2] Notation for the memory bank update rule and the noise regularization term should be formalized with an equation to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential engineering contributions of LetsTalk. We respond point by point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the noise-regularized memory bank together with the Symbiotic/Direct fusion combination are the primary drivers of gains in temporal coherence, realism, and efficiency is not supported by ablations that isolate these components from dataset choices, training schedule, or other implementation details. Without such controls, attribution of the reported SOTA performance remains uncertain.

    Authors: Section 4 reports controlled ablations that hold the dataset, training schedule, and other implementation details fixed while varying only the memory bank (with vs. without noise regularization) and the fusion schemes. These experiments directly attribute gains in coherence and realism to the proposed components. We agree that the manuscript text could state the fixed factors more explicitly and will revise §4 and the abstract to highlight the controlled experimental design. revision: partial

  2. Referee: [Abstract] Abstract: The manuscript asserts 'new state-of-the-art in generation quality' and '8x fewer parameters' yet supplies no quantitative metrics, baseline comparisons, error bars, or dataset statistics in the provided text. These numbers are load-bearing for the central empirical claim and cannot be verified from the available sections.

    Authors: The abstract is a high-level summary. All load-bearing quantitative results—specific metrics, baseline tables, error bars from repeated runs, and dataset statistics—are presented in full in §4. The 8× parameter reduction is obtained from direct model-size comparisons reported in the same section. If the review copy omitted §4, we will ensure the complete manuscript is supplied; no changes to the abstract itself are required. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution

full rationale

The paper proposes an engineering framework (noise-regularized memory bank, deep compression autoencoder, linear-attention transformer, and Symbiotic/Direct fusion variants) and validates performance claims through experiments on generation quality, coherence, and efficiency. No load-bearing mathematical derivation, parameter fitting presented as prediction, or self-citation chain is present; the central claims reduce to empirical results rather than reducing to inputs by construction. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the memory bank is described as a mechanism rather than a new postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5770 in / 1100 out tokens · 33649 ms · 2026-05-23T17:15:20.320312+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...

  2. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  3. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.

  4. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

  5. SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.

  6. AUHead: Realistic Emotional Talking Head Generation via Action Units Control

    cs.CV 2026-02 unverdicted novelty 5.0

    AUHead uses audio-language models to generate Action Unit sequences from speech and feeds them into a controllable diffusion model to synthesize realistic emotional talking-head videos.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 4 Pith papers · 8 internal anchors

  1. [1]

    SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,

    W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661

  2. [2]

    AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

    H. Wei, Z. Yang, and Z. Wang, “AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,”arXiv preprint arXiv:2403.17694, 2024

  3. [3]

    Anyonenet: Synchronized speech and talking head generation for arbitrary persons,

    X. Wang, Q. Xie, J. Zhu, L. Xie, and O. Scharenborg, “Anyonenet: Synchronized speech and talking head generation for arbitrary persons,” IEEE Transactions on Multimedia, vol. 25, pp. 6717–6728, 2022

  4. [4]

    Talkclip: Talking head generation with text-guided expressive speaking styles,

    Y . Ma, S. Wang, Y . Ding, B. Ma, T. Lv, C. Fan, Z. Hu, Z. Deng, and X. Yu, “Talkclip: Talking head generation with text-guided expressive speaking styles,”IEEE Transactions on Multimedia, 2025

  5. [5]

    A Morphable Model For The Synthesis Of 3D Faces,

    V . Blanz and T. Vetter, “A Morphable Model For The Synthesis Of 3D Faces,” inProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’99. USA: ACM Press/Addison-Wesley Publishing Co., 1999, p. 187–194. [Online]. Available: https://doi.org/10.1145/311535.311556

  6. [6]

    Learning a model of facial shape and expression from 4D scans,

    T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,”ACM Trans. Graph., vol. 36, no. 6, pp. 194–1, 2017

  7. [7]

    High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,

    L. Bao, X. Lin, Y . Chen, H. Zhang, S. Wang, X. Zhe, D. Kang, H. Huang, X. Jiang, J. Wanget al., “High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,”ACM Transactions on Graphics (TOG), vol. 41, no. 1, pp. 1–21, 2021

  8. [8]

    Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,

    L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7832–7841

  9. [9]

    A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,

    K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492

  10. [10]

    MakeItTalk: Speaker-Aware Talking-Head Animation,

    Y . Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “MakeItTalk: Speaker-Aware Talking-Head Animation,”ACM Transactions On Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020

  11. [11]

    VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,

    K. Cheng, X. Cun, Y . Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9

  12. [12]

    Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,

    H. Zhou, Y . Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4176–4186

  13. [13]

    Predicting personalized head movement from short video and speech signal,

    R. Yi, Z. Ye, Z. Sun, J. Zhang, G. Zhang, P. Wan, H. Bao, and Y .- J. Liu, “Predicting personalized head movement from short video and speech signal,”IEEE Transactions on Multimedia, vol. 25, pp. 6315– 6328, 2022

  14. [14]

    Ta2v: Text-audio guided video generation,

    M. Zhao, W. Wang, T. Chen, R. Zhang, and R. Li, “Ta2v: Text-audio guided video generation,”IEEE Transactions on Multimedia, vol. 26, pp. 7250–7264, 2024

  15. [15]

    Denoising Diffusion Probabilistic Models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  16. [16]

    Diffusion Models Beat Gans on Image Synthesis,

    P. Dhariwal and A. Nichol, “Diffusion Models Beat Gans on Image Synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

  17. [17]

    High-Resolution Image Synthesis With Latent Diffusion Models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis With Latent Diffusion Models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  18. [18]

    Adding Conditional Control to Text-to-Image Diffusion Models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

  19. [19]

    Latte: Latent Diffusion Transformer for Video Generation

    X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent Diffusion Transformer for Video Generation,” arXiv preprint arXiv:2401.03048, 2024

  20. [20]

    VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,

    H. Lu, G. Yang, N. Fei, Y . Huo, Z. Lu, P. Luo, and M. Ding, “VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,” arXiv preprint arXiv:2305.13311, 2023

  21. [21]

    Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,

    L. Hu, “Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163

  22. [22]

    Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,

    M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, L. Van Gool, Y . Yao, and S. Zhu, “Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,”arXiv preprint arXiv:2406.08801, 2024

  23. [23]

    EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,

    Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,”arXiv preprint arXiv:2407.08136, 2024

  24. [24]

    Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,

    J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,”arXiv preprint arXiv:2410.07718, 2024

  25. [25]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Luet al., “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,”arXiv preprint arXiv:2410.10629, 2024

  26. [26]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate Your Personalized Text- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10 To-Image Diffusion Models Without Specific Tuning,”arXiv preprint arXiv:2307.04725, 2023

  27. [27]

    Video Diffusion Models,

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022

  28. [28]

    L.; Dai, Z.; Xu, Y.; Cao, X.; Yao, Y.; Zhu, H.; and Zhu, S

    S. Zhu, J. L. Chen, Z. Dai, Q. Su, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance,”arXiv preprint arXiv:2403.14781, 2024

  29. [29]

    LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022

  30. [30]

    Imagen Video: High Definition Video Generation with Diffusion Models

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleetet al., “Imagen Video: High Definition Video Generation With Diffusion Models,”arXiv preprint arXiv:2210.02303, 2022

  31. [31]

    VideoComposer: Compositional Video Synthesis With Motion Controllability,

    X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y . Zhang, Y . Shen, D. Zhao, and J. Zhou, “VideoComposer: Compositional Video Synthesis With Motion Controllability,”Advances in Neural Information Process- ing Systems, vol. 36, 2024

  32. [32]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,”arXiv preprint arXiv:2310.19512, 2023

  33. [33]

    Scalable Diffusion Models With Transformers,

    W. Peebles and S. Xie, “Scalable Diffusion Models With Transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

  34. [34]

    GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,

    Z. Ye, Z. Jiang, Y . Ren, J. Liu, J. He, and Z. Zhao, “GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,” arXiv preprint arXiv:2301.13430, 2023

  35. [35]

    Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,

    M. Stypułkowski, K. V ougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, “Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5091–5100

  36. [36]

    DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

    Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,”arXiv preprint arXiv:2312.09767, 2023

  37. [37]

    VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,

    X. Sun, L. Zhang, H. Zhu, P. Zhang, B. Zhang, X. Ji, K. Zhou, D. Gao, L. Bo, and X. Cao, “VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,”arXiv preprint arXiv:2312.01841, 2023

  38. [38]

    V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,

    S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,”arXiv preprint arXiv:2404.10667, 2024

  39. [39]

    EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,

    L. Tian, Q. Wang, B. Zhang, and L. Bo, “EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,”arXiv preprint arXiv:2402.17485, 2024

  40. [40]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  41. [41]

    Blended Diffusion for Text- driven Editing of Natural Images,

    O. Avrahami, D. Lischinski, and O. Fried, “Blended Diffusion for Text- driven Editing of Natural Images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 208–18 218

  42. [42]

    InstructPix2Pix: Learning to Follow Image Editing Instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to Follow Image Editing Instructions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402

  43. [43]

    Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,

    X. Yang, C. Ding, Z. Hong, J. Huang, J. Tao, and X. Xu, “Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7017–7026

  44. [44]

    Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,

    L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 954–15 964

  45. [45]

    Structure and Content-Guided Video Synthesis with Diffusion Models,

    P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and Content-Guided Video Synthesis with Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356

  46. [46]

    Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

    T.-S. Chen, C. H. Lin, H.-Y . Tseng, T.-Y . Lin, and M.-H. Yang, “Motion- Conditioned Diffusion Model for Controllable Video Synthesis,”arXiv preprint arXiv:2304.14404, 2023

  47. [47]

    LaMD: Latent Motion Diffusion for Video Generation,

    Y . Hu, Z. Chen, and C. Luo, “LaMD: Latent Motion Diffusion for Video Generation,”arXiv preprint arXiv:2304.11603, 2023

  48. [48]

    MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,

    Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

  49. [49]

    DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,

    S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982–1991

  50. [50]

    Attention Is All You Need,

    A. Vaswani, “Attention Is All You Need,”Advances in Neural Informa- tion Processing Systems, 2017

  51. [51]

    Transformers are rnns: Fast autoregressive transformers with linear attention,

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165

  52. [52]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021

  53. [53]

    Emu3: Next-Token Prediction is All You Need

    X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yuet al., “Emu3: Next-token prediction is all you need,”arXiv preprint arXiv:2409.18869, 2024

  54. [54]

    wav2vec: Unsupervised Pre-training for Speech Recognition,

    S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-training for Speech Recognition,”arXiv preprint arXiv:1904.05862, 2019

  55. [55]

    Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,

    Z. Zhang, L. Li, Y . Ding, and C. Fan, “Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670

  56. [56]

    CelebV-HQ: A Large-scale Video Facial Attributes Dataset,

    H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “CelebV-HQ: A Large-scale Video Facial Attributes Dataset,” in European conference on computer vision. Springer, 2022, pp. 650– 667

  57. [57]

    A Style-Based Generator Architecture for Generative Adversarial Networks,

    T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401– 4410

  58. [58]

    Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,

    Y . Deng, J. Yang, S. Xu, D. Chen, Y . Jia, and X. Tong, “Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0

  59. [59]

    All are Worth Words: A ViT Backbone for Diffusion Models,

    F. Bao, S. Nie, K. Xue, Y . Cao, C. Li, H. Su, and J. Zhu, “All are Worth Words: A ViT Backbone for Diffusion Models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 669–22 679