Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

Bingyan Liu; Haojie Zhang; Jianhua Tao; Ruibo Fu; Xuefei Liu; Yaling Liang; Zhengqi Wen; Zhihao Liang

arxiv: 2411.16748 · v5 · submitted 2024-11-24 · 💻 cs.CV

Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

Haojie Zhang , Zhihao Liang , Ruibo Fu , Bingyan Liu , Zhengqi Wen , Xuefei Liu , Jianhua Tao , Yaling Liang This is my paper

Pith reviewed 2026-05-23 17:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords talking video generationdiffusion transformermemory bankmultimodal fusiontemporal coherencelong video synthesisvideo generation efficiencyspatiotemporal modeling

0 comments

The pith

A diffusion transformer with a noise-regularized memory bank generates long-duration talking videos that stay coherent and realistic while using eight times fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the breakdown in quality, consistency, and coherence that occurs when synthesizing talking videos over extended lengths. It adds a memory bank that stores prior context and regularizes it with noise to limit error buildup during generation. A compressed autoencoder and a linear-attention transformer handle the multimodal inputs efficiently. Tests of fusion strategies show that deep fusion on portrait features paired with shallow fusion on audio yields the best balance of realism, speech accuracy, and movement variety. The result is claimed to deliver higher-quality output at far lower parameter cost than earlier methods.

Core claim

The proposed framework is a diffusion transformer that maintains contextual continuity for long-duration talking video generation using a noise-regularized memory bank along with a deep compression autoencoder and a spatiotemporal transformer, achieving superior quality and efficiency with eight times fewer parameters by combining symbiotic fusion for portrait features with direct fusion for audio.

What carries the argument

The noise-regularized memory bank, which stores contextual information from prior frames and adds noise to reduce error accumulation and sampling artifacts in long sequences.

Load-bearing premise

The performance gains arise primarily from the memory bank and the specific portrait-audio fusion choices rather than from training data or other unstated details.

What would settle it

Generate the same long video sequences both with and without the noise-regularized memory bank and check whether temporal artifacts, portrait drift, and error accumulation rise sharply in the version that lacks the bank.

Figures

Figures reproduced from arXiv: 2411.16748 by Bingyan Liu, Haojie Zhang, Jianhua Tao, Ruibo Fu, Xuefei Liu, Yaling Liang, Zhengqi Wen, Zhihao Liang.

**Figure 1.** Figure 1: We introduce LetsTalk, a diffusion-based transformer for audio-driven portrait animation. Given a reference image and audio, LetsTalk generates realistic videos with synchronized mouth motions. As shown in the Left figure, each column corresponds to the same audio, demonstrating consistent and accurate lip movements. The Right figure compares generation quality and inference time on the HDTF dataset, where… view at source ↗

**Figure 2.** Figure 2: Overview of our LetsTalk framework for robust long-duration talking head video generation. Our system combines a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The illustration of the long-duration generation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Multimodal fusion schemes: (a) Direct Fusion injects conditions via cross-attention modules; (b) Siamese Fusion uses parallel transformer for feature guidance; (c) Symbiotic Fusion achieves fusion through input concatenation and self-attention. The backbone architecture (left-side blocks) remains consistent across all approaches. Schemes Integration State Params Modality Adapt. Direct Cross-attention Stati… view at source ↗

**Figure 5.** Figure 5: The qualitative comparisons with other cutting-edge methods on the HDTF dataset. Our method achieves better audio [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The qualitative comparisons with the existing portrait image animation approaches on the CelebV-HQ dataset. Our [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Portrait animation results without audio guidance for [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: User study results (%) on (a) realism (left) and (b) synchronization (right). the diversity of character actions in the generated videos. The results in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait consistency, temporal coherence, and computational efficiency. As video length increases, issues such as visual degradation, portrait drift, temporal artifacts, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal modeling, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LetsTalk adds a noise-regularized memory bank and fusion comparison to diffusion transformers for longer talking videos, but the SOTA and efficiency claims need the actual numbers to land.

read the letter

The paper's main addition is a noise-regularized memory bank inside a multimodal diffusion transformer, meant to cut error accumulation and portrait drift when generating extended talking videos. It also runs a direct comparison of fusion schemes, settling on symbiotic fusion for portrait features and direct fusion for audio. The backbone uses a deep compression autoencoder plus linear-attention spatiotemporal transformer to keep compute down. These pieces form a coherent engineering package on top of existing diffusion transformer work. The memory bank idea is a reasonable response to the known problem of drift over long sequences, and spelling out the fusion trade-offs gives readers something concrete to test in their own setups. The efficiency angle, with the claimed 8x parameter drop, would matter if the experiments hold. The soft spot is that the abstract asserts new state-of-the-art quality, better liveliness, and the parameter savings, yet supplies no tables, baselines, or error bars in the text provided. Without those, it is impossible to separate the contribution of the memory bank and fusion choices from dataset effects or other implementation details. The internal logic is consistent and the claims are testable rather than circular, but the central advantage still rests on the results section. This paper is aimed at researchers working on scalable video generation, especially talking-head or avatar applications. A reader who needs practical knobs for temporal consistency in diffusion models could extract the fusion analysis and memory design. It deserves a serious referee because the problem is real, the method is described at a level that supports reproduction, and the experiments can be checked directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces LetsTalk, a multimodal diffusion transformer framework for long-duration talking video generation. It proposes a noise-regularized memory bank to alleviate error accumulation and sampling artifacts, a deep compression autoencoder, and a linear-attention spatiotemporal transformer. The authors systematically compare three multimodal fusion schemes and conclude that Symbiotic Fusion for portrait features paired with Direct Fusion for audio yields superior visual realism, speech-driven motion, and movement diversity. The work claims new state-of-the-art results in generation quality, temporal coherence, realism, diversity, and liveliness, together with an 8x reduction in parameters relative to prior approaches.

Significance. If the empirical results hold, the work would be a meaningful engineering contribution to scalable talking-head video synthesis by demonstrating practical mechanisms for long-sequence coherence and parameter efficiency. The explicit analysis of fusion strategies and the noise-regularized memory bank are concrete, testable advances that could influence subsequent multimodal diffusion designs. The claimed 8x parameter reduction, if substantiated with controlled comparisons, would be a notable strength for deployment-oriented applications.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim that the noise-regularized memory bank together with the Symbiotic/Direct fusion combination are the primary drivers of gains in temporal coherence, realism, and efficiency is not supported by ablations that isolate these components from dataset choices, training schedule, or other implementation details. Without such controls, attribution of the reported SOTA performance remains uncertain.
[Abstract] Abstract: The manuscript asserts 'new state-of-the-art in generation quality' and '8x fewer parameters' yet supplies no quantitative metrics, baseline comparisons, error bars, or dataset statistics in the provided text. These numbers are load-bearing for the central empirical claim and cannot be verified from the available sections.

minor comments (2)

[§3.3] The description of the three fusion schemes would benefit from an explicit diagram or pseudocode showing the exact information flow between portrait, audio, and latent features.
[§3.2] Notation for the memory bank update rule and the noise regularization term should be formalized with an equation to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential engineering contributions of LetsTalk. We respond point by point to the major comments below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the noise-regularized memory bank together with the Symbiotic/Direct fusion combination are the primary drivers of gains in temporal coherence, realism, and efficiency is not supported by ablations that isolate these components from dataset choices, training schedule, or other implementation details. Without such controls, attribution of the reported SOTA performance remains uncertain.

Authors: Section 4 reports controlled ablations that hold the dataset, training schedule, and other implementation details fixed while varying only the memory bank (with vs. without noise regularization) and the fusion schemes. These experiments directly attribute gains in coherence and realism to the proposed components. We agree that the manuscript text could state the fixed factors more explicitly and will revise §4 and the abstract to highlight the controlled experimental design. revision: partial
Referee: [Abstract] Abstract: The manuscript asserts 'new state-of-the-art in generation quality' and '8x fewer parameters' yet supplies no quantitative metrics, baseline comparisons, error bars, or dataset statistics in the provided text. These numbers are load-bearing for the central empirical claim and cannot be verified from the available sections.

Authors: The abstract is a high-level summary. All load-bearing quantitative results—specific metrics, baseline tables, error bars from repeated runs, and dataset statistics—are presented in full in §4. The 8× parameter reduction is obtained from direct model-size comparisons reported in the same section. If the review copy omitted §4, we will ensure the complete manuscript is supplied; no changes to the abstract itself are required. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution

full rationale

The paper proposes an engineering framework (noise-regularized memory bank, deep compression autoencoder, linear-attention transformer, and Symbiotic/Direct fusion variants) and validates performance claims through experiments on generation quality, coherence, and efficiency. No load-bearing mathematical derivation, parameter fitting presented as prediction, or self-citation chain is present; the central claims reduce to empirical results rather than reducing to inputs by construction. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the memory bank is described as a mechanism rather than a new postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5770 in / 1100 out tokens · 33649 ms · 2026-05-23T17:15:20.320312+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
cs.CV 2026-04 unverdicted novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
AUHead: Realistic Emotional Talking Head Generation via Action Units Control
cs.CV 2026-02 unverdicted novelty 5.0

AUHead uses audio-language models to generate Action Unit sequences from speech and feeds them into a controllable diffusion model to synthesize realistic emotional talking-head videos.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 4 Pith papers · 8 internal anchors

[1]

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,

W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661

work page 2023
[2]

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

H. Wei, Z. Yang, and Z. Wang, “AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,”arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024
[3]

Anyonenet: Synchronized speech and talking head generation for arbitrary persons,

X. Wang, Q. Xie, J. Zhu, L. Xie, and O. Scharenborg, “Anyonenet: Synchronized speech and talking head generation for arbitrary persons,” IEEE Transactions on Multimedia, vol. 25, pp. 6717–6728, 2022

work page 2022
[4]

Talkclip: Talking head generation with text-guided expressive speaking styles,

Y . Ma, S. Wang, Y . Ding, B. Ma, T. Lv, C. Fan, Z. Hu, Z. Deng, and X. Yu, “Talkclip: Talking head generation with text-guided expressive speaking styles,”IEEE Transactions on Multimedia, 2025

work page 2025
[5]

A Morphable Model For The Synthesis Of 3D Faces,

V . Blanz and T. Vetter, “A Morphable Model For The Synthesis Of 3D Faces,” inProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’99. USA: ACM Press/Addison-Wesley Publishing Co., 1999, p. 187–194. [Online]. Available: https://doi.org/10.1145/311535.311556

work page doi:10.1145/311535.311556 1999
[6]

Learning a model of facial shape and expression from 4D scans,

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,”ACM Trans. Graph., vol. 36, no. 6, pp. 194–1, 2017

work page 2017
[7]

High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,

L. Bao, X. Lin, Y . Chen, H. Zhang, S. Wang, X. Zhe, D. Kang, H. Huang, X. Jiang, J. Wanget al., “High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,”ACM Transactions on Graphics (TOG), vol. 41, no. 1, pp. 1–21, 2021

work page 2021
[8]

Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,

L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7832–7841

work page 2019
[9]

A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,

K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492

work page 2020
[10]

MakeItTalk: Speaker-Aware Talking-Head Animation,

Y . Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “MakeItTalk: Speaker-Aware Talking-Head Animation,”ACM Transactions On Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020

work page 2020
[11]

VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,

K. Cheng, X. Cun, Y . Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9

work page 2022
[12]

Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,

H. Zhou, Y . Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4176–4186

work page 2021
[13]

Predicting personalized head movement from short video and speech signal,

R. Yi, Z. Ye, Z. Sun, J. Zhang, G. Zhang, P. Wan, H. Bao, and Y .- J. Liu, “Predicting personalized head movement from short video and speech signal,”IEEE Transactions on Multimedia, vol. 25, pp. 6315– 6328, 2022

work page 2022
[14]

Ta2v: Text-audio guided video generation,

M. Zhao, W. Wang, T. Chen, R. Zhang, and R. Li, “Ta2v: Text-audio guided video generation,”IEEE Transactions on Multimedia, vol. 26, pp. 7250–7264, 2024

work page 2024
[15]

Denoising Diffusion Probabilistic Models,

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[16]

Diffusion Models Beat Gans on Image Synthesis,

P. Dhariwal and A. Nichol, “Diffusion Models Beat Gans on Image Synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

work page 2021
[17]

High-Resolution Image Synthesis With Latent Diffusion Models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis With Latent Diffusion Models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[18]

Adding Conditional Control to Text-to-Image Diffusion Models,

L. Zhang, A. Rao, and M. Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

work page 2023
[19]

Latte: Latent Diffusion Transformer for Video Generation

X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent Diffusion Transformer for Video Generation,” arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,

H. Lu, G. Yang, N. Fei, Y . Huo, Z. Lu, P. Luo, and M. Ding, “VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,” arXiv preprint arXiv:2305.13311, 2023

work page arXiv 2023
[21]

Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,

L. Hu, “Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163

work page 2024
[22]

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,

M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, L. Van Gool, Y . Yao, and S. Zhu, “Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,”arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024
[23]

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,

Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,”arXiv preprint arXiv:2407.08136, 2024

work page arXiv 2024
[24]

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,

J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,”arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024
[25]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Luet al., “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,”arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate Your Personalized Text- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10 To-Image Diffusion Models Without Specific Tuning,”arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Video Diffusion Models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022

work page 2022
[28]

L.; Dai, Z.; Xu, Y.; Cao, X.; Yao, Y.; Zhu, H.; and Zhu, S

S. Zhu, J. L. Chen, Z. Dai, Q. Su, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance,”arXiv preprint arXiv:2403.14781, 2024

work page arXiv 2024
[29]

LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022

work page 2022
[30]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleetet al., “Imagen Video: High Definition Video Generation With Diffusion Models,”arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

VideoComposer: Compositional Video Synthesis With Motion Controllability,

X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y . Zhang, Y . Shen, D. Zhao, and J. Zhou, “VideoComposer: Compositional Video Synthesis With Motion Controllability,”Advances in Neural Information Process- ing Systems, vol. 36, 2024

work page 2024
[32]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,”arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Scalable Diffusion Models With Transformers,

W. Peebles and S. Xie, “Scalable Diffusion Models With Transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

work page 2023
[34]

GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,

Z. Ye, Z. Jiang, Y . Ren, J. Liu, J. He, and Z. Zhao, “GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,” arXiv preprint arXiv:2301.13430, 2023

work page arXiv 2023
[35]

Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,

M. Stypułkowski, K. V ougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, “Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5091–5100

work page 2024
[36]

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,”arXiv preprint arXiv:2312.09767, 2023

work page arXiv 2023
[37]

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,

X. Sun, L. Zhang, H. Zhu, P. Zhang, B. Zhang, X. Ji, K. Zhou, D. Gao, L. Bo, and X. Cao, “VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,”arXiv preprint arXiv:2312.01841, 2023

work page arXiv 2023
[38]

V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,

S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,”arXiv preprint arXiv:2404.10667, 2024

work page arXiv 2024
[39]

EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,

L. Tian, Q. Wang, B. Zhang, and L. Bo, “EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,”arXiv preprint arXiv:2402.17485, 2024

work page arXiv 2024
[40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Blended Diffusion for Text- driven Editing of Natural Images,

O. Avrahami, D. Lischinski, and O. Fried, “Blended Diffusion for Text- driven Editing of Natural Images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 208–18 218

work page 2022
[42]

InstructPix2Pix: Learning to Follow Image Editing Instructions,

T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to Follow Image Editing Instructions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402

work page 2023
[43]

Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,

X. Yang, C. Ding, Z. Hong, J. Huang, J. Tao, and X. Xu, “Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7017–7026

work page 2024
[44]

Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,

L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 954–15 964

work page 2023
[45]

Structure and Content-Guided Video Synthesis with Diffusion Models,

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and Content-Guided Video Synthesis with Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356

work page 2023
[46]

Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

T.-S. Chen, C. H. Lin, H.-Y . Tseng, T.-Y . Lin, and M.-H. Yang, “Motion- Conditioned Diffusion Model for Controllable Video Synthesis,”arXiv preprint arXiv:2304.14404, 2023

work page arXiv 2023
[47]

LaMD: Latent Motion Diffusion for Video Generation,

Y . Hu, Z. Chen, and C. Luo, “LaMD: Latent Motion Diffusion for Video Generation,”arXiv preprint arXiv:2304.11603, 2023

work page arXiv 2023
[48]

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,

Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

work page 2024
[49]

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,

S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982–1991

work page 2023
[50]

Attention Is All You Need,

A. Vaswani, “Attention Is All You Need,”Advances in Neural Informa- tion Processing Systems, 2017

work page 2017
[51]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165

work page 2020
[52]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

Emu3: Next-Token Prediction is All You Need

X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yuet al., “Emu3: Next-token prediction is all you need,”arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

wav2vec: Unsupervised Pre-training for Speech Recognition,

S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-training for Speech Recognition,”arXiv preprint arXiv:1904.05862, 2019

work page arXiv 1904
[55]

Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,

Z. Zhang, L. Li, Y . Ding, and C. Fan, “Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670

work page 2021
[56]

CelebV-HQ: A Large-scale Video Facial Attributes Dataset,

H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “CelebV-HQ: A Large-scale Video Facial Attributes Dataset,” in European conference on computer vision. Springer, 2022, pp. 650– 667

work page 2022
[57]

A Style-Based Generator Architecture for Generative Adversarial Networks,

T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401– 4410

work page 2019
[58]

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,

Y . Deng, J. Yang, S. Xu, D. Chen, Y . Jia, and X. Tong, “Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0

work page 2019
[59]

All are Worth Words: A ViT Backbone for Diffusion Models,

F. Bao, S. Nie, K. Xue, Y . Cao, C. Li, H. Su, and J. Zhu, “All are Worth Words: A ViT Backbone for Diffusion Models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 669–22 679

work page 2023

[1] [1]

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,

W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661

work page 2023

[2] [2]

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

H. Wei, Z. Yang, and Z. Wang, “AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,”arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024

[3] [3]

Anyonenet: Synchronized speech and talking head generation for arbitrary persons,

X. Wang, Q. Xie, J. Zhu, L. Xie, and O. Scharenborg, “Anyonenet: Synchronized speech and talking head generation for arbitrary persons,” IEEE Transactions on Multimedia, vol. 25, pp. 6717–6728, 2022

work page 2022

[4] [4]

Talkclip: Talking head generation with text-guided expressive speaking styles,

Y . Ma, S. Wang, Y . Ding, B. Ma, T. Lv, C. Fan, Z. Hu, Z. Deng, and X. Yu, “Talkclip: Talking head generation with text-guided expressive speaking styles,”IEEE Transactions on Multimedia, 2025

work page 2025

[5] [5]

A Morphable Model For The Synthesis Of 3D Faces,

V . Blanz and T. Vetter, “A Morphable Model For The Synthesis Of 3D Faces,” inProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’99. USA: ACM Press/Addison-Wesley Publishing Co., 1999, p. 187–194. [Online]. Available: https://doi.org/10.1145/311535.311556

work page doi:10.1145/311535.311556 1999

[6] [6]

Learning a model of facial shape and expression from 4D scans,

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,”ACM Trans. Graph., vol. 36, no. 6, pp. 194–1, 2017

work page 2017

[7] [7]

High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,

L. Bao, X. Lin, Y . Chen, H. Zhang, S. Wang, X. Zhe, D. Kang, H. Huang, X. Jiang, J. Wanget al., “High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,”ACM Transactions on Graphics (TOG), vol. 41, no. 1, pp. 1–21, 2021

work page 2021

[8] [8]

Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,

L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7832–7841

work page 2019

[9] [9]

A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,

K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492

work page 2020

[10] [10]

MakeItTalk: Speaker-Aware Talking-Head Animation,

Y . Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “MakeItTalk: Speaker-Aware Talking-Head Animation,”ACM Transactions On Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020

work page 2020

[11] [11]

VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,

K. Cheng, X. Cun, Y . Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9

work page 2022

[12] [12]

Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,

H. Zhou, Y . Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4176–4186

work page 2021

[13] [13]

Predicting personalized head movement from short video and speech signal,

R. Yi, Z. Ye, Z. Sun, J. Zhang, G. Zhang, P. Wan, H. Bao, and Y .- J. Liu, “Predicting personalized head movement from short video and speech signal,”IEEE Transactions on Multimedia, vol. 25, pp. 6315– 6328, 2022

work page 2022

[14] [14]

Ta2v: Text-audio guided video generation,

M. Zhao, W. Wang, T. Chen, R. Zhang, and R. Li, “Ta2v: Text-audio guided video generation,”IEEE Transactions on Multimedia, vol. 26, pp. 7250–7264, 2024

work page 2024

[15] [15]

Denoising Diffusion Probabilistic Models,

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[16] [16]

Diffusion Models Beat Gans on Image Synthesis,

P. Dhariwal and A. Nichol, “Diffusion Models Beat Gans on Image Synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

work page 2021

[17] [17]

High-Resolution Image Synthesis With Latent Diffusion Models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis With Latent Diffusion Models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[18] [18]

Adding Conditional Control to Text-to-Image Diffusion Models,

L. Zhang, A. Rao, and M. Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

work page 2023

[19] [19]

Latte: Latent Diffusion Transformer for Video Generation

X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent Diffusion Transformer for Video Generation,” arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,

H. Lu, G. Yang, N. Fei, Y . Huo, Z. Lu, P. Luo, and M. Ding, “VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,” arXiv preprint arXiv:2305.13311, 2023

work page arXiv 2023

[21] [21]

Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,

L. Hu, “Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163

work page 2024

[22] [22]

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,

M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, L. Van Gool, Y . Yao, and S. Zhu, “Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,”arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024

[23] [23]

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,

Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,”arXiv preprint arXiv:2407.08136, 2024

work page arXiv 2024

[24] [24]

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,

J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,”arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024

[25] [25]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Luet al., “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,”arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate Your Personalized Text- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10 To-Image Diffusion Models Without Specific Tuning,”arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

Video Diffusion Models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022

work page 2022

[28] [28]

L.; Dai, Z.; Xu, Y.; Cao, X.; Yao, Y.; Zhu, H.; and Zhu, S

S. Zhu, J. L. Chen, Z. Dai, Q. Su, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance,”arXiv preprint arXiv:2403.14781, 2024

work page arXiv 2024

[29] [29]

LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022

work page 2022

[30] [30]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleetet al., “Imagen Video: High Definition Video Generation With Diffusion Models,”arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

VideoComposer: Compositional Video Synthesis With Motion Controllability,

X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y . Zhang, Y . Shen, D. Zhao, and J. Zhou, “VideoComposer: Compositional Video Synthesis With Motion Controllability,”Advances in Neural Information Process- ing Systems, vol. 36, 2024

work page 2024

[32] [32]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,”arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Scalable Diffusion Models With Transformers,

W. Peebles and S. Xie, “Scalable Diffusion Models With Transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

work page 2023

[34] [34]

GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,

Z. Ye, Z. Jiang, Y . Ren, J. Liu, J. He, and Z. Zhao, “GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,” arXiv preprint arXiv:2301.13430, 2023

work page arXiv 2023

[35] [35]

Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,

M. Stypułkowski, K. V ougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, “Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5091–5100

work page 2024

[36] [36]

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,”arXiv preprint arXiv:2312.09767, 2023

work page arXiv 2023

[37] [37]

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,

X. Sun, L. Zhang, H. Zhu, P. Zhang, B. Zhang, X. Ji, K. Zhou, D. Gao, L. Bo, and X. Cao, “VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,”arXiv preprint arXiv:2312.01841, 2023

work page arXiv 2023

[38] [38]

V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,

S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,”arXiv preprint arXiv:2404.10667, 2024

work page arXiv 2024

[39] [39]

EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,

L. Tian, Q. Wang, B. Zhang, and L. Bo, “EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,”arXiv preprint arXiv:2402.17485, 2024

work page arXiv 2024

[40] [40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Blended Diffusion for Text- driven Editing of Natural Images,

O. Avrahami, D. Lischinski, and O. Fried, “Blended Diffusion for Text- driven Editing of Natural Images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 208–18 218

work page 2022

[42] [42]

InstructPix2Pix: Learning to Follow Image Editing Instructions,

T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to Follow Image Editing Instructions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402

work page 2023

[43] [43]

Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,

X. Yang, C. Ding, Z. Hong, J. Huang, J. Tao, and X. Xu, “Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7017–7026

work page 2024

[44] [44]

Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,

L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 954–15 964

work page 2023

[45] [45]

Structure and Content-Guided Video Synthesis with Diffusion Models,

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and Content-Guided Video Synthesis with Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356

work page 2023

[46] [46]

Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

T.-S. Chen, C. H. Lin, H.-Y . Tseng, T.-Y . Lin, and M.-H. Yang, “Motion- Conditioned Diffusion Model for Controllable Video Synthesis,”arXiv preprint arXiv:2304.14404, 2023

work page arXiv 2023

[47] [47]

LaMD: Latent Motion Diffusion for Video Generation,

Y . Hu, Z. Chen, and C. Luo, “LaMD: Latent Motion Diffusion for Video Generation,”arXiv preprint arXiv:2304.11603, 2023

work page arXiv 2023

[48] [48]

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,

Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

work page 2024

[49] [49]

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,

S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982–1991

work page 2023

[50] [50]

Attention Is All You Need,

A. Vaswani, “Attention Is All You Need,”Advances in Neural Informa- tion Processing Systems, 2017

work page 2017

[51] [51]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165

work page 2020

[52] [52]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[53] [53]

Emu3: Next-Token Prediction is All You Need

X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yuet al., “Emu3: Next-token prediction is all you need,”arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

wav2vec: Unsupervised Pre-training for Speech Recognition,

S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-training for Speech Recognition,”arXiv preprint arXiv:1904.05862, 2019

work page arXiv 1904

[55] [55]

Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,

Z. Zhang, L. Li, Y . Ding, and C. Fan, “Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670

work page 2021

[56] [56]

CelebV-HQ: A Large-scale Video Facial Attributes Dataset,

H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “CelebV-HQ: A Large-scale Video Facial Attributes Dataset,” in European conference on computer vision. Springer, 2022, pp. 650– 667

work page 2022

[57] [57]

A Style-Based Generator Architecture for Generative Adversarial Networks,

T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401– 4410

work page 2019

[58] [58]

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,

Y . Deng, J. Yang, S. Xu, D. Chen, Y . Jia, and X. Tong, “Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0

work page 2019

[59] [59]

All are Worth Words: A ViT Backbone for Diffusion Models,

F. Bao, S. Nie, K. Xue, Y . Cao, C. Li, H. Su, and J. Zhu, “All are Worth Words: A ViT Backbone for Diffusion Models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 669–22 679

work page 2023