pith. sign in

arxiv: 2506.23552 · v2 · pith:PUBU7QBYnew · submitted 2025-06-30 · 💻 cs.CV · cs.SD· eess.AS

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Pith reviewed 2026-05-22 00:42 UTC · model grok-4.3

classification 💻 cs.CV cs.SDeess.AS
keywords joint audio-motion synthesisflow matchingmulti-modal diffusion transformertalking head generationaudio-driven animationcross-modal attentioninpainting objectiveunified generative model
0
0 comments X

The pith

A unified flow-matching model with coupled audio and motion transformers jointly synthesizes speech and facial animation from text, audio, or motion inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a single framework called JAM-Flow that generates facial motion and speech together instead of treating them as separate tasks. It builds this on flow matching combined with a Multi-Modal Diffusion Transformer that runs specialized Motion-DiT and Audio-DiT modules linked by selective joint attention. The model trains on an inpainting objective so it can accept flexible conditioning such as text prompts, reference audio clips, or reference motion sequences. A sympathetic reader would care because the intrinsic coupling between voice and face could then be captured inside one coherent network rather than stitched together from independent systems.

Core claim

JAM-Flow is a unified framework that leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture to simultaneously synthesize and condition on both facial motion and speech. Specialized Motion-DiT and Audio-DiT modules are coupled via selective joint attention layers that use temporally aligned positional embeddings and localized joint attention masking. Trained with an inpainting-style objective, the model supports conditioning on text, reference audio, and reference motion to perform synchronized talking-head generation from text, audio-driven animation, and additional tasks inside one model.

What carries the argument

Multi-Modal Diffusion Transformer (MM-DiT) whose Motion-DiT and Audio-DiT modules are coupled through selective joint attention layers with localized masking and aligned positional embeddings.

If this is right

  • Text prompts alone can drive synchronized talking-head video output.
  • Reference audio can animate a source face without separate lip-sync modules.
  • Reference motion can condition speech generation in the reverse direction.
  • An inpainting objective lets the same weights handle missing modalities at inference time.
  • Multiple audio-visual tasks run inside one coherent trained network rather than separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-attention pattern could be tested on full-body motion paired with audio or on longer video clips with background sound.
  • Training on mixed conditioning might reduce the need for task-specific fine-tuning in animation or virtual-agent pipelines.
  • If the localized masking proves robust, similar selective coupling could be applied to other paired modalities such as gesture and text.
  • The flow-matching backbone may allow faster sampling than diffusion baselines when generating both streams together.

Load-bearing premise

The selective joint attention layers and localized masking enable effective cross-modal interaction while still preserving each modality's independent strengths.

What would settle it

If a model trained with the same data but without the joint attention layers produces audio-motion pairs that are measurably less synchronized or lower in quality on standard benchmarks, the benefit of the coupled architecture would be refuted.

Figures

Figures reproduced from arXiv: 2506.23552 by Jaeseok Jung, Jaesik Park, Joonghyuk Shin, Mingi Kwon, Youngjung Uh.

Figure 1
Figure 1. Figure 1: Overview of our JAM-Flow framework for flexible and joint generation of facial motion [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LivePortrait framework and mouth-related expression keypoint analysis. LivePortrait’s [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The training and inference pipeline of the JAM-Flow framework. Our joint MM-DiT com [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces JAM-Flow, a unified framework for simultaneous synthesis and conditioning on facial motion and speech. It employs flow matching together with a novel Multi-Modal Diffusion Transformer (MM-DiT) that couples specialized Motion-DiT and Audio-DiT modules via selective joint attention layers, temporally aligned positional embeddings, and localized joint attention masking. The model is trained with an inpainting-style objective and supports conditioning on text, reference audio, and reference motion to enable tasks such as text-conditioned talking-head generation and audio-driven animation within a single coherent model.

Significance. If the claimed cross-modal benefits materialize, JAM-Flow would constitute a practical advance in multi-modal generative modeling by replacing separate talking-head and TTS pipelines with a single flow-matching model that handles a wide range of conditioning inputs. The architectural emphasis on preserving modality-specific strengths while enabling interaction is a potentially useful design pattern for other audio-visual tasks.

major comments (1)
  1. [MM-DiT architecture description] The central claim that the selective joint attention layers and localized joint attention masking in the MM-DiT produce effective cross-modal interaction while preserving modality-specific strengths is load-bearing for the paper's contribution. No ablation results are presented that isolate these components against independent Motion-DiT/Audio-DiT modules or simpler fusion baselines; therefore it remains unclear whether any observed gains on joint tasks arise from the proposed attention mechanism or simply from the shared flow-matching objective and inpainting loss.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one quantitative result or baseline comparison to support the claim of a 'significant advance.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of isolating the contributions of the MM-DiT components. We address the major comment below and describe the planned revisions.

read point-by-point responses
  1. Referee: [MM-DiT architecture description] The central claim that the selective joint attention layers and localized joint attention masking in the MM-DiT produce effective cross-modal interaction while preserving modality-specific strengths is load-bearing for the paper's contribution. No ablation results are presented that isolate these components against independent Motion-DiT/Audio-DiT modules or simpler fusion baselines; therefore it remains unclear whether any observed gains on joint tasks arise from the proposed attention mechanism or simply from the shared flow-matching objective and inpainting loss.

    Authors: We agree that the absence of targeted ablations leaves the specific contribution of the selective joint attention layers and localized joint attention masking insufficiently isolated. The current results demonstrate end-to-end performance but do not directly compare against independent Motion-DiT and Audio-DiT modules or simpler fusion baselines such as feature concatenation or standard cross-attention. In the revised manuscript we will add these ablation experiments on the same training and evaluation splits, reporting both quantitative metrics (e.g., synchronization error, perceptual quality) and qualitative visualizations that highlight the effect of the proposed masking and joint-attention design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: model architecture and training described without self-referential derivations

full rationale

The paper proposes JAM-Flow as a new unified framework combining flow matching with a custom MM-DiT architecture that couples Motion-DiT and Audio-DiT via selective joint attention, temporally aligned embeddings, and localized masking, trained under an inpainting-style objective. No equations, predictions, or first-principles results are shown that reduce by construction to fitted inputs, self-citations, or renamed known results. All components are presented as design choices justified by their intended cross-modal behavior on external data, leaving the central claims independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly introduced MM-DiT coupling mechanism and the inpainting training objective; these are presented without independent prior validation or external benchmarks in the available text.

free parameters (1)
  • localized joint attention masking parameters
    Key architectural choice for cross-modal interaction whose specific values are not detailed in the abstract.
axioms (1)
  • domain assumption Temporally aligned positional embeddings maintain synchronization between audio and motion sequences
    Invoked to support effective cross-modal interaction in the MM-DiT.
invented entities (1)
  • Multi-Modal Diffusion Transformer (MM-DiT) no independent evidence
    purpose: To couple specialized Motion-DiT and Audio-DiT modules via selective joint attention
    New architecture component introduced to enable joint synthesis.

pith-pipeline@v0.9.0 · 5739 in / 1364 out tokens · 66063 ms · 2026-05-22T00:42:32.119358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we introduce Njoint layers of joint attention between the audio and motion streams... we apply scaled rotary positional embeddings (RoPE)... attention masking strategy that respects the temporal dynamics

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 7 internal anchors

  1. [1]

    Generative adversarial nets, 2014

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets, 2014

  2. [2]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Conference on Neural Information Processing Systems (NeurIPS) , 2020

  3. [3]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR), 2021

  4. [4]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, et al. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In International Conference on Learning Representations (ICLR) , 2023

  5. [5]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow Matching for Generative Modeling. In International Conference on Learning Representations (ICLR) , 2023

  6. [6]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  7. [7]

    Liveportrait: Efficient portrait animation with stitching and retargeting control

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168, 2024

  8. [8]

    X-portrait: Expressive portrait animation with hierarchical motion attention

    You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, and Linjie Luo. X-portrait: Expressive portrait animation with hierarchical motion attention. In ACM SIGGRAPH, 2024

  9. [9]

    First order motion model for image animation

    Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Conference on Neural Information Processing Systems (NeurIPS) , 2019

  10. [10]

    Emoportraits: Emotion-enhanced multimodal one-shot head avatars

    Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

  11. [11]

    One-shot free-view neural talking-head synthesis for video conferencing

    Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

  12. [12]

    K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C. V . Jawahar. Wav2Lip: A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM) , 2020

  13. [13]

    MakeItTalk: Speaker-aware talking-head animation

    Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. MakeItTalk: Speaker-aware talking-head animation. In ACM SIGGRAPH Asia, 2020

  14. [14]

    Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

    Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision (ECCV). Springer, 2024

  15. [15]

    Vasa-1: Lifelike audio-driven talking faces generated in real time

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Conference on Neural Information Processing Systems (NeurIPS) , 2024. 10

  16. [16]

    Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. OmniHuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. arXiv preprint arXiv:2502.01061, 2025

  17. [17]

    SadTalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. SadTalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  18. [18]

    F5-TTS: High-fidelity text-to-speech via conditional flow matching and inpainting

    Junyang Chen, Chenpeng Du, Zhenhui Ye, and Yanwei Fu. F5-TTS: High-fidelity text-to-speech via conditional flow matching and inpainting. arXiv preprint arXiv:2411.00000, 2024

  19. [19]

    Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,

    Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, et al. Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis. arXiv preprint arXiv:2502.18924, 2025

  20. [20]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML) , 2024

  21. [21]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  22. [22]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  23. [23]

    Black Forest Labs. Flux.1. https://blackforestlabs.ai/announcing-black-forest-labs/ ,

  24. [24]

    Accessed: November 2024

  25. [25]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

  26. [26]

    Stylerig: Rigging stylegan for 3d control over portrait images

    Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. Stylerig: Rigging stylegan for 3d control over portrait images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6142–6151, 2020

  27. [27]

    Tacotron: Towards End-to-End Speech Synthesis

    Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017

  28. [28]

    Fastspeech: Fast, robust and controllable text to speech

    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Conference on Neural Information Processing Systems (NeurIPS) , 2019

  29. [29]

    Fastspeech 2: Fast and high-quality end-to-end text to speech

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020

  30. [30]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023

  31. [31]

    Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

    Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024

  32. [32]

    Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024

  33. [33]

    Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

    Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023

  34. [34]

    V oicebox: Text-guided multilingual universal speech generation at scale

    Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sarı, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text-guided multilingual universal speech generation at scale. In Conference on Neural Information Processing Systems (NeurIPS) , 2023

  35. [35]

    Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

    Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. arXiv preprint arXiv:2406.11427, 2024. 11

  36. [36]

    E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

    Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 682–689. IEEE, 2024

  37. [37]

    Learning to dub movies via hierarchical prosody models

    Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  38. [38]

    Styledubber: towards multi-scale style learning for movie dubbing

    Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming- Hsuan Yang, Chenggang Yan, and Qingming Huang. Styledubber: towards multi-scale style learning for movie dubbing. arXiv preprint arXiv:2402.12636, 2024

  39. [39]

    V oiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

    Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, and David Harwath. V oiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models. arXiv preprint arXiv:2504.02386, 2025

  40. [40]

    Flow-guided one-shot talking face genera- tion with a high-resolution audio-visual dataset

    Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face genera- tion with a high-resolution audio-visual dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  41. [41]

    DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

    Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2(3), 2023

  42. [42]

    AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

    Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024

  43. [43]

    Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,

    Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801, 2024

  44. [44]

    Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks

    Jiahui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks. arXiv preprint arXiv:2412.00733, 2024

  45. [45]

    Librispeech: an asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

  46. [46]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024

  47. [47]

    Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

    Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications. arXiv preprint arXiv:2409.03283, 2024

  48. [48]

    V oicecraft: Zero-shot speech editing and text-to-speech in the wild

    Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. V oicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprint arXiv:2403.16973, 2024

  49. [49]

    Celebv-hq: A large-scale video facial attributes dataset

    Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In European Conference on Computer Vision (ECCV), 2022

  50. [50]

    Celebv-text: A large-scale facial text-video dataset

    Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  51. [51]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Conference on Neural Information Processing Systems (NeurIPS), 2017

  52. [52]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  53. [53]

    Identity-preserving talking face generation with landmark and appearance priors

    Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity-preserving talking face generation with landmark and appearance priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12

  54. [54]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Learning Representations (ICLR), 2023

  55. [55]

    Spleeter: a fast and efficient music source separation tool with pre-trained models

    Romain Hennequin, Anis Khlif, Felix V oituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020

  56. [56]

    Audio-visual speech representation expert for enhanced talking face video generation and evaluation

    Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Akti, Hazım Kemal Ekenel, and Alexander Waibel. Audio-visual speech representation expert for enhanced talking face video generation and evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2024

  57. [57]

    Sidgan: High-resolution dubbed video generation via shift-invariant learning

    Urwa Muaz, Wondong Jang, Rohun Tripathi, Santhosh Mani, Wenbin Ouyang, Ravi Teja Gadde, Baris Gecer, Sergio Elizondo, Reza Madad, and Naveen Nair. Sidgan: High-resolution dubbed video generation via shift-invariant learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  58. [58]

    Out of time: automated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017

  59. [59]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  60. [60]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. 13 A Qualitative Comparisons, User Study, and Discussions A.1 Qualitative Analysis We provide extensive qualitative comparisons across fi...

  61. [61]

    Project input to QKV: (q1, k1, v1) ← attn1.TOQKV (x1) (q2, k2, v2) ← attn2.TOQKV (x2)

  62. [62]

    Apply rotary embeddings (if provided): if rope1 exists then (q1, k1) ← APPLYROPE(q1, k1, rope1) if rope2 exists then (q2, k2) ← APPLYROPE(q2, k2, rope2)

  63. [63]

    Construct joint token pools: if α1 = 1 then q⋆ 1 ← [q1; q2], k⋆ 1 ← [k1; k2], v⋆ 1 ← [v1; v2] else (q⋆ 1, k⋆ 1, v⋆

  64. [64]

    ← (q1, k1, v1) if α2 = 1 then q⋆ 2 ← [q2; q1], k⋆ 2 ← [k2; k1], v⋆ 2 ← [v2; v1] else (q⋆ 2, k⋆ 2, v⋆

  65. [65]

    Split heads and apply masks: (q⋆ 1, k⋆ 1, v⋆

  66. [66]

    ← SPLIT HEADS (q⋆ 1, k⋆ 1, v⋆ 1) (q⋆ 2, k⋆ 2, v⋆

  67. [67]

    ← SPLIT HEADS (q⋆ 2, k⋆ 2, v⋆ 2) if mask1 ̸= ∅ ∧ α1 = 1 then M1 ← CUSTOM DIAGMASK (L1, L2, mask1) else M1 ← ∅ if mask2 ̸= ∅ ∧ α2 = 1 then M2 ← CUSTOM DIAGMASK (L2, L1, mask2) else M2 ← ∅

  68. [68]

    Compute scaled dot-product attention: o⋆ 1 ← SDPA(q⋆ 1, k⋆ 1, v⋆ 1, M1) o⋆ 2 ← SDPA(q⋆ 2, k⋆ 2, v⋆ 2, M2)

  69. [69]

    Merge heads, trim to original length, and project: o1 ← MERGE HEADS (o⋆ 1)[:, : L1], o1 ← attn1.OUT PROJ(o1) o2 ← MERGE HEADS (o⋆ 2)[:, : L2], o2 ← attn2.OUT PROJ(o2) return (o1, o2) 5