Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation
Pith reviewed 2026-05-23 17:15 UTC · model grok-4.3
The pith
A diffusion transformer with a noise-regularized memory bank generates long-duration talking videos that stay coherent and realistic while using eight times fewer parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed framework is a diffusion transformer that maintains contextual continuity for long-duration talking video generation using a noise-regularized memory bank along with a deep compression autoencoder and a spatiotemporal transformer, achieving superior quality and efficiency with eight times fewer parameters by combining symbiotic fusion for portrait features with direct fusion for audio.
What carries the argument
The noise-regularized memory bank, which stores contextual information from prior frames and adds noise to reduce error accumulation and sampling artifacts in long sequences.
Load-bearing premise
The performance gains arise primarily from the memory bank and the specific portrait-audio fusion choices rather than from training data or other unstated details.
What would settle it
Generate the same long video sequences both with and without the noise-regularized memory bank and check whether temporal artifacts, portrait drift, and error accumulation rise sharply in the version that lacks the bank.
Figures
read the original abstract
Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait consistency, temporal coherence, and computational efficiency. As video length increases, issues such as visual degradation, portrait drift, temporal artifacts, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal modeling, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LetsTalk, a multimodal diffusion transformer framework for long-duration talking video generation. It proposes a noise-regularized memory bank to alleviate error accumulation and sampling artifacts, a deep compression autoencoder, and a linear-attention spatiotemporal transformer. The authors systematically compare three multimodal fusion schemes and conclude that Symbiotic Fusion for portrait features paired with Direct Fusion for audio yields superior visual realism, speech-driven motion, and movement diversity. The work claims new state-of-the-art results in generation quality, temporal coherence, realism, diversity, and liveliness, together with an 8x reduction in parameters relative to prior approaches.
Significance. If the empirical results hold, the work would be a meaningful engineering contribution to scalable talking-head video synthesis by demonstrating practical mechanisms for long-sequence coherence and parameter efficiency. The explicit analysis of fusion strategies and the noise-regularized memory bank are concrete, testable advances that could influence subsequent multimodal diffusion designs. The claimed 8x parameter reduction, if substantiated with controlled comparisons, would be a notable strength for deployment-oriented applications.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim that the noise-regularized memory bank together with the Symbiotic/Direct fusion combination are the primary drivers of gains in temporal coherence, realism, and efficiency is not supported by ablations that isolate these components from dataset choices, training schedule, or other implementation details. Without such controls, attribution of the reported SOTA performance remains uncertain.
- [Abstract] Abstract: The manuscript asserts 'new state-of-the-art in generation quality' and '8x fewer parameters' yet supplies no quantitative metrics, baseline comparisons, error bars, or dataset statistics in the provided text. These numbers are load-bearing for the central empirical claim and cannot be verified from the available sections.
minor comments (2)
- [§3.3] The description of the three fusion schemes would benefit from an explicit diagram or pseudocode showing the exact information flow between portrait, audio, and latent features.
- [§3.2] Notation for the memory bank update rule and the noise regularization term should be formalized with an equation to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential engineering contributions of LetsTalk. We respond point by point to the major comments below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the noise-regularized memory bank together with the Symbiotic/Direct fusion combination are the primary drivers of gains in temporal coherence, realism, and efficiency is not supported by ablations that isolate these components from dataset choices, training schedule, or other implementation details. Without such controls, attribution of the reported SOTA performance remains uncertain.
Authors: Section 4 reports controlled ablations that hold the dataset, training schedule, and other implementation details fixed while varying only the memory bank (with vs. without noise regularization) and the fusion schemes. These experiments directly attribute gains in coherence and realism to the proposed components. We agree that the manuscript text could state the fixed factors more explicitly and will revise §4 and the abstract to highlight the controlled experimental design. revision: partial
-
Referee: [Abstract] Abstract: The manuscript asserts 'new state-of-the-art in generation quality' and '8x fewer parameters' yet supplies no quantitative metrics, baseline comparisons, error bars, or dataset statistics in the provided text. These numbers are load-bearing for the central empirical claim and cannot be verified from the available sections.
Authors: The abstract is a high-level summary. All load-bearing quantitative results—specific metrics, baseline tables, error bars from repeated runs, and dataset statistics—are presented in full in §4. The 8× parameter reduction is obtained from direct model-size comparisons reported in the same section. If the review copy omitted §4, we will ensure the complete manuscript is supplied; no changes to the abstract itself are required. revision: no
Circularity Check
No significant circularity; empirical engineering contribution
full rationale
The paper proposes an engineering framework (noise-regularized memory bank, deep compression autoencoder, linear-attention transformer, and Symbiotic/Direct fusion variants) and validates performance claims through experiments on generation quality, coherence, and efficiency. No load-bearing mathematical derivation, parameter fitting presented as prediction, or self-citation chain is present; the central claims reduce to empirical results rather than reducing to inputs by construction. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 6 Pith papers
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
-
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
-
AUHead: Realistic Emotional Talking Head Generation via Action Units Control
AUHead uses audio-language models to generate Action Unit sequences from speech and feeds them into a controllable diffusion model to synthesize realistic emotional talking-head videos.
Reference graph
Works this paper leans on
-
[1]
W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661
work page 2023
-
[2]
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,
H. Wei, Z. Yang, and Z. Wang, “AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,”arXiv preprint arXiv:2403.17694, 2024
-
[3]
Anyonenet: Synchronized speech and talking head generation for arbitrary persons,
X. Wang, Q. Xie, J. Zhu, L. Xie, and O. Scharenborg, “Anyonenet: Synchronized speech and talking head generation for arbitrary persons,” IEEE Transactions on Multimedia, vol. 25, pp. 6717–6728, 2022
work page 2022
-
[4]
Talkclip: Talking head generation with text-guided expressive speaking styles,
Y . Ma, S. Wang, Y . Ding, B. Ma, T. Lv, C. Fan, Z. Hu, Z. Deng, and X. Yu, “Talkclip: Talking head generation with text-guided expressive speaking styles,”IEEE Transactions on Multimedia, 2025
work page 2025
-
[5]
A Morphable Model For The Synthesis Of 3D Faces,
V . Blanz and T. Vetter, “A Morphable Model For The Synthesis Of 3D Faces,” inProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’99. USA: ACM Press/Addison-Wesley Publishing Co., 1999, p. 187–194. [Online]. Available: https://doi.org/10.1145/311535.311556
-
[6]
Learning a model of facial shape and expression from 4D scans,
T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,”ACM Trans. Graph., vol. 36, no. 6, pp. 194–1, 2017
work page 2017
-
[7]
High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,
L. Bao, X. Lin, Y . Chen, H. Zhang, S. Wang, X. Zhe, D. Kang, H. Huang, X. Jiang, J. Wanget al., “High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies,”ACM Transactions on Graphics (TOG), vol. 41, no. 1, pp. 1–21, 2021
work page 2021
-
[8]
Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,
L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical Cross- Modal Talking Face Generation With Dynamic Pixel-Wise Loss,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7832–7841
work page 2019
-
[9]
A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,
K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492
work page 2020
-
[10]
MakeItTalk: Speaker-Aware Talking-Head Animation,
Y . Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “MakeItTalk: Speaker-Aware Talking-Head Animation,”ACM Transactions On Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020
work page 2020
-
[11]
VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,
K. Cheng, X. Cun, Y . Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9
work page 2022
-
[12]
Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,
H. Zhou, Y . Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose- Controllable Talking Face Generation by Implicitly Modularized Audio- Visual Representation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4176–4186
work page 2021
-
[13]
Predicting personalized head movement from short video and speech signal,
R. Yi, Z. Ye, Z. Sun, J. Zhang, G. Zhang, P. Wan, H. Bao, and Y .- J. Liu, “Predicting personalized head movement from short video and speech signal,”IEEE Transactions on Multimedia, vol. 25, pp. 6315– 6328, 2022
work page 2022
-
[14]
Ta2v: Text-audio guided video generation,
M. Zhao, W. Wang, T. Chen, R. Zhang, and R. Li, “Ta2v: Text-audio guided video generation,”IEEE Transactions on Multimedia, vol. 26, pp. 7250–7264, 2024
work page 2024
-
[15]
Denoising Diffusion Probabilistic Models,
J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[16]
Diffusion Models Beat Gans on Image Synthesis,
P. Dhariwal and A. Nichol, “Diffusion Models Beat Gans on Image Synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021
work page 2021
-
[17]
High-Resolution Image Synthesis With Latent Diffusion Models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis With Latent Diffusion Models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[18]
Adding Conditional Control to Text-to-Image Diffusion Models,
L. Zhang, A. Rao, and M. Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847
work page 2023
-
[19]
Latte: Latent Diffusion Transformer for Video Generation
X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent Diffusion Transformer for Video Generation,” arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,
H. Lu, G. Yang, N. Fei, Y . Huo, Z. Lu, P. Luo, and M. Ding, “VDT: General-Purpose Video Diffusion Transformers via Mask Modeling,” arXiv preprint arXiv:2305.13311, 2023
-
[21]
Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,
L. Hu, “Animate Anyone: Consistent and Controllable Image-To-Video Synthesis for Character Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163
work page 2024
-
[22]
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,
M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, L. Van Gool, Y . Yao, and S. Zhu, “Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,”arXiv preprint arXiv:2406.08801, 2024
-
[23]
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,
Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,”arXiv preprint arXiv:2407.08136, 2024
-
[24]
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,
J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,”arXiv preprint arXiv:2410.07718, 2024
-
[25]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Luet al., “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,”arXiv preprint arXiv:2410.10629, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate Your Personalized Text- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10 To-Image Diffusion Models Without Specific Tuning,”arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022
work page 2022
-
[28]
L.; Dai, Z.; Xu, Y.; Cao, X.; Yao, Y.; Zhu, H.; and Zhu, S
S. Zhu, J. L. Chen, Z. Dai, Q. Su, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance,”arXiv preprint arXiv:2403.14781, 2024
-
[29]
LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “LAION-5B: An Open Large-Scale Dataset for Training Next Gener- ation Image-Text Models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022
work page 2022
-
[30]
Imagen Video: High Definition Video Generation with Diffusion Models
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleetet al., “Imagen Video: High Definition Video Generation With Diffusion Models,”arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
VideoComposer: Compositional Video Synthesis With Motion Controllability,
X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y . Zhang, Y . Shen, D. Zhao, and J. Zhou, “VideoComposer: Compositional Video Synthesis With Motion Controllability,”Advances in Neural Information Process- ing Systems, vol. 36, 2024
work page 2024
-
[32]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,”arXiv preprint arXiv:2310.19512, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Scalable Diffusion Models With Transformers,
W. Peebles and S. Xie, “Scalable Diffusion Models With Transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205
work page 2023
-
[34]
GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,
Z. Ye, Z. Jiang, Y . Ren, J. Liu, J. He, and Z. Zhao, “GeneFace: Gen- eralized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,” arXiv preprint arXiv:2301.13430, 2023
-
[35]
Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,
M. Stypułkowski, K. V ougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, “Diffused Heads: Diffusion Models Beat Gans on Talking- Face Generation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5091–5100
work page 2024
-
[36]
DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,
Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,”arXiv preprint arXiv:2312.09767, 2023
-
[37]
VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,
X. Sun, L. Zhang, H. Zhu, P. Zhang, B. Zhang, X. Ji, K. Zhou, D. Gao, L. Bo, and X. Cao, “VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,”arXiv preprint arXiv:2312.01841, 2023
-
[38]
V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,
S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,”arXiv preprint arXiv:2404.10667, 2024
-
[39]
L. Tian, Q. Wang, B. Zhang, and L. Bo, “EMO: Emote Portrait Alive-Generating Expressive Portrait Videos With audio2video Diffusion Model Under Weak Conditions,”arXiv preprint arXiv:2402.17485, 2024
-
[40]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Blended Diffusion for Text- driven Editing of Natural Images,
O. Avrahami, D. Lischinski, and O. Fried, “Blended Diffusion for Text- driven Editing of Natural Images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 208–18 218
work page 2022
-
[42]
InstructPix2Pix: Learning to Follow Image Editing Instructions,
T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to Follow Image Editing Instructions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402
work page 2023
-
[43]
Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,
X. Yang, C. Ding, Z. Hong, J. Huang, J. Tao, and X. Xu, “Texture- Preserving Diffusion Models for High-Fidelity Virtual Try-On,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7017–7026
work page 2024
-
[44]
Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,
L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2Video-Zero: Text-to-Image Diffu- sion Models are Zero-Shot Video Generators,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 954–15 964
work page 2023
-
[45]
Structure and Content-Guided Video Synthesis with Diffusion Models,
P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and Content-Guided Video Synthesis with Diffusion Models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356
work page 2023
-
[46]
Motion- Conditioned Diffusion Model for Controllable Video Synthesis,
T.-S. Chen, C. H. Lin, H.-Y . Tseng, T.-Y . Lin, and M.-H. Yang, “Motion- Conditioned Diffusion Model for Controllable Video Synthesis,”arXiv preprint arXiv:2304.14404, 2023
-
[47]
LaMD: Latent Motion Diffusion for Video Generation,
Y . Hu, Z. Chen, and C. Luo, “LaMD: Latent Motion Diffusion for Video Generation,”arXiv preprint arXiv:2304.11603, 2023
-
[48]
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,
Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “MotionCtrl: A Unified and Flexible Motion Controller for Video Generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11
work page 2024
-
[49]
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,
S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982–1991
work page 2023
-
[50]
A. Vaswani, “Attention Is All You Need,”Advances in Neural Informa- tion Processing Systems, 2017
work page 2017
-
[51]
Transformers are rnns: Fast autoregressive transformers with linear attention,
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165
work page 2020
-
[52]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
Emu3: Next-Token Prediction is All You Need
X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yuet al., “Emu3: Next-token prediction is all you need,”arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
wav2vec: Unsupervised Pre-training for Speech Recognition,
S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-training for Speech Recognition,”arXiv preprint arXiv:1904.05862, 2019
-
[55]
Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,
Z. Zhang, L. Li, Y . Ding, and C. Fan, “Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670
work page 2021
-
[56]
CelebV-HQ: A Large-scale Video Facial Attributes Dataset,
H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “CelebV-HQ: A Large-scale Video Facial Attributes Dataset,” in European conference on computer vision. Springer, 2022, pp. 650– 667
work page 2022
-
[57]
A Style-Based Generator Architecture for Generative Adversarial Networks,
T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401– 4410
work page 2019
-
[58]
Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,
Y . Deng, J. Yang, S. Xu, D. Chen, Y . Jia, and X. Tong, “Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0
work page 2019
-
[59]
All are Worth Words: A ViT Backbone for Diffusion Models,
F. Bao, S. Nie, K. Xue, Y . Cao, C. Li, H. Su, and J. Zhu, “All are Worth Words: A ViT Backbone for Diffusion Models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 669–22 679
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.