JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Jaeseok Jung; Jaesik Park; Joonghyuk Shin; Mingi Kwon; Youngjung Uh

arxiv: 2506.23552 · v2 · pith:PUBU7QBYnew · submitted 2025-06-30 · 💻 cs.CV · cs.SD· eess.AS

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon , Joonghyuk Shin , Jaeseok Jung , Jaesik Park , Youngjung Uh This is my paper

Pith reviewed 2026-05-22 00:42 UTC · model grok-4.3

classification 💻 cs.CV cs.SDeess.AS

keywords joint audio-motion synthesisflow matchingmulti-modal diffusion transformertalking head generationaudio-driven animationcross-modal attentioninpainting objectiveunified generative model

0 comments

The pith

A unified flow-matching model with coupled audio and motion transformers jointly synthesizes speech and facial animation from text, audio, or motion inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a single framework called JAM-Flow that generates facial motion and speech together instead of treating them as separate tasks. It builds this on flow matching combined with a Multi-Modal Diffusion Transformer that runs specialized Motion-DiT and Audio-DiT modules linked by selective joint attention. The model trains on an inpainting objective so it can accept flexible conditioning such as text prompts, reference audio clips, or reference motion sequences. A sympathetic reader would care because the intrinsic coupling between voice and face could then be captured inside one coherent network rather than stitched together from independent systems.

Core claim

JAM-Flow is a unified framework that leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture to simultaneously synthesize and condition on both facial motion and speech. Specialized Motion-DiT and Audio-DiT modules are coupled via selective joint attention layers that use temporally aligned positional embeddings and localized joint attention masking. Trained with an inpainting-style objective, the model supports conditioning on text, reference audio, and reference motion to perform synchronized talking-head generation from text, audio-driven animation, and additional tasks inside one model.

What carries the argument

Multi-Modal Diffusion Transformer (MM-DiT) whose Motion-DiT and Audio-DiT modules are coupled through selective joint attention layers with localized masking and aligned positional embeddings.

If this is right

Text prompts alone can drive synchronized talking-head video output.
Reference audio can animate a source face without separate lip-sync modules.
Reference motion can condition speech generation in the reverse direction.
An inpainting objective lets the same weights handle missing modalities at inference time.
Multiple audio-visual tasks run inside one coherent trained network rather than separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-attention pattern could be tested on full-body motion paired with audio or on longer video clips with background sound.
Training on mixed conditioning might reduce the need for task-specific fine-tuning in animation or virtual-agent pipelines.
If the localized masking proves robust, similar selective coupling could be applied to other paired modalities such as gesture and text.
The flow-matching backbone may allow faster sampling than diffusion baselines when generating both streams together.

Load-bearing premise

The selective joint attention layers and localized masking enable effective cross-modal interaction while still preserving each modality's independent strengths.

What would settle it

If a model trained with the same data but without the joint attention layers produces audio-motion pairs that are measurably less synchronized or lower in quality on standard benchmarks, the benefit of the coupled architecture would be refuted.

Figures

Figures reproduced from arXiv: 2506.23552 by Jaeseok Jung, Jaesik Park, Joonghyuk Shin, Mingi Kwon, Youngjung Uh.

**Figure 2.** Figure 2: LivePortrait framework and mouth-related expression keypoint analysis. LivePortrait’s [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The training and inference pipeline of the JAM-Flow framework. Our joint MM-DiT com [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JAM-Flow unifies audio and facial motion generation via flow matching and a custom MM-DiT with joint attention, but the architecture's claimed cross-modal gains still need ablations to stand out from simpler shared-objective baselines.

read the letter

The main point is that this paper builds a single flow-matching model that can synthesize speech and facial motion together while accepting mixed conditioning like text, reference audio, or reference motion. The core proposal is the MM-DiT that splits into Motion-DiT and Audio-DiT modules linked by selective joint attention layers plus localized masking and aligned embeddings. That setup is meant to let the modalities interact without one dominating the other, and the inpainting-style loss supports the flexible task list they describe, from text-driven talking heads to audio-driven animation.

Referee Report

1 major / 1 minor

Summary. The paper introduces JAM-Flow, a unified framework for simultaneous synthesis and conditioning on facial motion and speech. It employs flow matching together with a novel Multi-Modal Diffusion Transformer (MM-DiT) that couples specialized Motion-DiT and Audio-DiT modules via selective joint attention layers, temporally aligned positional embeddings, and localized joint attention masking. The model is trained with an inpainting-style objective and supports conditioning on text, reference audio, and reference motion to enable tasks such as text-conditioned talking-head generation and audio-driven animation within a single coherent model.

Significance. If the claimed cross-modal benefits materialize, JAM-Flow would constitute a practical advance in multi-modal generative modeling by replacing separate talking-head and TTS pipelines with a single flow-matching model that handles a wide range of conditioning inputs. The architectural emphasis on preserving modality-specific strengths while enabling interaction is a potentially useful design pattern for other audio-visual tasks.

major comments (1)

[MM-DiT architecture description] The central claim that the selective joint attention layers and localized joint attention masking in the MM-DiT produce effective cross-modal interaction while preserving modality-specific strengths is load-bearing for the paper's contribution. No ablation results are presented that isolate these components against independent Motion-DiT/Audio-DiT modules or simpler fusion baselines; therefore it remains unclear whether any observed gains on joint tasks arise from the proposed attention mechanism or simply from the shared flow-matching objective and inpainting loss.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one quantitative result or baseline comparison to support the claim of a 'significant advance.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of isolating the contributions of the MM-DiT components. We address the major comment below and describe the planned revisions.

read point-by-point responses

Referee: [MM-DiT architecture description] The central claim that the selective joint attention layers and localized joint attention masking in the MM-DiT produce effective cross-modal interaction while preserving modality-specific strengths is load-bearing for the paper's contribution. No ablation results are presented that isolate these components against independent Motion-DiT/Audio-DiT modules or simpler fusion baselines; therefore it remains unclear whether any observed gains on joint tasks arise from the proposed attention mechanism or simply from the shared flow-matching objective and inpainting loss.

Authors: We agree that the absence of targeted ablations leaves the specific contribution of the selective joint attention layers and localized joint attention masking insufficiently isolated. The current results demonstrate end-to-end performance but do not directly compare against independent Motion-DiT and Audio-DiT modules or simpler fusion baselines such as feature concatenation or standard cross-attention. In the revised manuscript we will add these ablation experiments on the same training and evaluation splits, reporting both quantitative metrics (e.g., synchronization error, perceptual quality) and qualitative visualizations that highlight the effect of the proposed masking and joint-attention design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: model architecture and training described without self-referential derivations

full rationale

The paper proposes JAM-Flow as a new unified framework combining flow matching with a custom MM-DiT architecture that couples Motion-DiT and Audio-DiT via selective joint attention, temporally aligned embeddings, and localized masking, trained under an inpainting-style objective. No equations, predictions, or first-principles results are shown that reduce by construction to fitted inputs, self-citations, or renamed known results. All components are presented as design choices justified by their intended cross-modal behavior on external data, leaving the central claims independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly introduced MM-DiT coupling mechanism and the inpainting training objective; these are presented without independent prior validation or external benchmarks in the available text.

free parameters (1)

localized joint attention masking parameters
Key architectural choice for cross-modal interaction whose specific values are not detailed in the abstract.

axioms (1)

domain assumption Temporally aligned positional embeddings maintain synchronization between audio and motion sequences
Invoked to support effective cross-modal interaction in the MM-DiT.

invented entities (1)

Multi-Modal Diffusion Transformer (MM-DiT) no independent evidence
purpose: To couple specialized Motion-DiT and Audio-DiT modules via selective joint attention
New architecture component introduced to enable joint synthesis.

pith-pipeline@v0.9.0 · 5739 in / 1364 out tokens · 66063 ms · 2026-05-22T00:42:32.119358+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce Njoint layers of joint attention between the audio and motion streams... we apply scaled rotary positional embeddings (RoPE)... attention masking strategy that respects the temporal dynamics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 7 internal anchors

[1]

Generative adversarial nets, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets, 2014

work page 2014
[2]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Conference on Neural Information Processing Systems (NeurIPS) , 2020

work page 2020
[3]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[4]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, et al. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In International Conference on Learning Representations (ICLR) , 2023

work page 2023
[5]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow Matching for Generative Modeling. In International Conference on Learning Representations (ICLR) , 2023

work page 2023
[6]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[7]

Liveportrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024
[8]

X-portrait: Expressive portrait animation with hierarchical motion attention

You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, and Linjie Luo. X-portrait: Expressive portrait animation with hierarchical motion attention. In ACM SIGGRAPH, 2024

work page 2024
[9]

First order motion model for image animation

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Conference on Neural Information Processing Systems (NeurIPS) , 2019

work page 2019
[10]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024
[11]

One-shot free-view neural talking-head synthesis for video conferencing

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

work page 2021
[12]

K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C. V . Jawahar. Wav2Lip: A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM) , 2020

work page 2020
[13]

MakeItTalk: Speaker-aware talking-head animation

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. MakeItTalk: Speaker-aware talking-head animation. In ACM SIGGRAPH Asia, 2020

work page 2020
[14]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision (ECCV). Springer, 2024

work page 2024
[15]

Vasa-1: Lifelike audio-driven talking faces generated in real time

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Conference on Neural Information Processing Systems (NeurIPS) , 2024. 10

work page 2024
[16]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. OmniHuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. arXiv preprint arXiv:2502.01061, 2025

work page arXiv 2025
[17]

SadTalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. SadTalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023
[18]

F5-TTS: High-fidelity text-to-speech via conditional flow matching and inpainting

Junyang Chen, Chenpeng Du, Zhenhui Ye, and Yanwei Fu. F5-TTS: High-fidelity text-to-speech via conditional flow matching and inpainting. arXiv preprint arXiv:2411.00000, 2024

work page arXiv 2024
[19]

Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,

Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, et al. Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis. arXiv preprint arXiv:2502.18924, 2025

work page arXiv 2025
[20]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML) , 2024

work page 2024
[21]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[22]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[23]

Black Forest Labs. Flux.1. https://blackforestlabs.ai/announcing-black-forest-labs/ ,

work page
[24]

Accessed: November 2024

work page 2024
[25]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Stylerig: Rigging stylegan for 3d control over portrait images

Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. Stylerig: Rigging stylegan for 3d control over portrait images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6142–6151, 2020

work page 2020
[27]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Fastspeech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Conference on Neural Information Processing Systems (NeurIPS) , 2019

work page 2019
[29]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020

work page arXiv 2006
[30]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024

work page 2024
[32]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024

work page arXiv 2024
[33]

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023

work page arXiv 2023
[34]

V oicebox: Text-guided multilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sarı, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text-guided multilingual universal speech generation at scale. In Conference on Neural Information Processing Systems (NeurIPS) , 2023

work page 2023
[35]

Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. arXiv preprint arXiv:2406.11427, 2024. 11

work page arXiv 2024
[36]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 682–689. IEEE, 2024

work page 2024
[37]

Learning to dub movies via hierarchical prosody models

Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023
[38]

Styledubber: towards multi-scale style learning for movie dubbing

Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming- Hsuan Yang, Chenggang Yan, and Qingming Huang. Styledubber: towards multi-scale style learning for movie dubbing. arXiv preprint arXiv:2402.12636, 2024

work page arXiv 2024
[39]

V oiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, and David Harwath. V oiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models. arXiv preprint arXiv:2504.02386, 2025

work page arXiv 2025
[40]

Flow-guided one-shot talking face genera- tion with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face genera- tion with a high-resolution audio-visual dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[41]

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2(3), 2023

work page arXiv 2023
[42]

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024
[43]

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024
[44]

Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks

Jiahui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks. arXiv preprint arXiv:2412.00733, 2024

work page arXiv 2024
[45]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

work page 2015
[46]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications. arXiv preprint arXiv:2409.03283, 2024

work page arXiv 2024
[48]

V oicecraft: Zero-shot speech editing and text-to-speech in the wild

Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. V oicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprint arXiv:2403.16973, 2024

work page arXiv 2024
[49]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In European Conference on Computer Vision (ECCV), 2022

work page 2022
[50]

Celebv-text: A large-scale facial text-video dataset

Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[51]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Conference on Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[52]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

Identity-preserving talking face generation with landmark and appearance priors

Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity-preserving talking face generation with landmark and appearance priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12

work page 2023
[54]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[55]

Spleeter: a fast and efficient music source separation tool with pre-trained models

Romain Hennequin, Anis Khlif, Felix V oituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020

work page 2020
[56]

Audio-visual speech representation expert for enhanced talking face video generation and evaluation

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Akti, Hazım Kemal Ekenel, and Alexander Waibel. Audio-visual speech representation expert for enhanced talking face video generation and evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2024

work page 2024
[57]

Sidgan: High-resolution dubbed video generation via shift-invariant learning

Urwa Muaz, Wondong Jang, Rohun Tripathi, Santhosh Mani, Wenbin Ouyang, Ravi Teja Gadde, Baris Gecer, Sergio Elizondo, Reza Madad, and Naveen Nair. Sidgan: High-resolution dubbed video generation via shift-invariant learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023
[58]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017

work page 2016
[59]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. 13 A Qualitative Comparisons, User Study, and Discussions A.1 Qualitative Analysis We provide extensive qualitative comparisons across fi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Project input to QKV: (q1, k1, v1) ← attn1.TOQKV (x1) (q2, k2, v2) ← attn2.TOQKV (x2)

work page
[62]

Apply rotary embeddings (if provided): if rope1 exists then (q1, k1) ← APPLYROPE(q1, k1, rope1) if rope2 exists then (q2, k2) ← APPLYROPE(q2, k2, rope2)

work page
[63]

Construct joint token pools: if α1 = 1 then q⋆ 1 ← [q1; q2], k⋆ 1 ← [k1; k2], v⋆ 1 ← [v1; v2] else (q⋆ 1, k⋆ 1, v⋆

work page
[64]

← (q1, k1, v1) if α2 = 1 then q⋆ 2 ← [q2; q1], k⋆ 2 ← [k2; k1], v⋆ 2 ← [v2; v1] else (q⋆ 2, k⋆ 2, v⋆

work page
[65]

Split heads and apply masks: (q⋆ 1, k⋆ 1, v⋆

work page
[66]

← SPLIT HEADS (q⋆ 1, k⋆ 1, v⋆ 1) (q⋆ 2, k⋆ 2, v⋆

work page
[67]

← SPLIT HEADS (q⋆ 2, k⋆ 2, v⋆ 2) if mask1 ̸= ∅ ∧ α1 = 1 then M1 ← CUSTOM DIAGMASK (L1, L2, mask1) else M1 ← ∅ if mask2 ̸= ∅ ∧ α2 = 1 then M2 ← CUSTOM DIAGMASK (L2, L1, mask2) else M2 ← ∅

work page
[68]

Compute scaled dot-product attention: o⋆ 1 ← SDPA(q⋆ 1, k⋆ 1, v⋆ 1, M1) o⋆ 2 ← SDPA(q⋆ 2, k⋆ 2, v⋆ 2, M2)

work page
[69]

Merge heads, trim to original length, and project: o1 ← MERGE HEADS (o⋆ 1)[:, : L1], o1 ← attn1.OUT PROJ(o1) o2 ← MERGE HEADS (o⋆ 2)[:, : L2], o2 ← attn2.OUT PROJ(o2) return (o1, o2) 5

work page

[1] [1]

Generative adversarial nets, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets, 2014

work page 2014

[2] [2]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Conference on Neural Information Processing Systems (NeurIPS) , 2020

work page 2020

[3] [3]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR), 2021

work page 2021

[4] [4]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, et al. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In International Conference on Learning Representations (ICLR) , 2023

work page 2023

[5] [5]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow Matching for Generative Modeling. In International Conference on Learning Representations (ICLR) , 2023

work page 2023

[6] [6]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[7] [7]

Liveportrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024

[8] [8]

X-portrait: Expressive portrait animation with hierarchical motion attention

You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, and Linjie Luo. X-portrait: Expressive portrait animation with hierarchical motion attention. In ACM SIGGRAPH, 2024

work page 2024

[9] [9]

First order motion model for image animation

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Conference on Neural Information Processing Systems (NeurIPS) , 2019

work page 2019

[10] [10]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024

[11] [11]

One-shot free-view neural talking-head synthesis for video conferencing

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

work page 2021

[12] [12]

K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C. V . Jawahar. Wav2Lip: A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM) , 2020

work page 2020

[13] [13]

MakeItTalk: Speaker-aware talking-head animation

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. MakeItTalk: Speaker-aware talking-head animation. In ACM SIGGRAPH Asia, 2020

work page 2020

[14] [14]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision (ECCV). Springer, 2024

work page 2024

[15] [15]

Vasa-1: Lifelike audio-driven talking faces generated in real time

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Conference on Neural Information Processing Systems (NeurIPS) , 2024. 10

work page 2024

[16] [16]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. OmniHuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. arXiv preprint arXiv:2502.01061, 2025

work page arXiv 2025

[17] [17]

SadTalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. SadTalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023

[18] [18]

F5-TTS: High-fidelity text-to-speech via conditional flow matching and inpainting

Junyang Chen, Chenpeng Du, Zhenhui Ye, and Yanwei Fu. F5-TTS: High-fidelity text-to-speech via conditional flow matching and inpainting. arXiv preprint arXiv:2411.00000, 2024

work page arXiv 2024

[19] [19]

Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,

Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, et al. Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis. arXiv preprint arXiv:2502.18924, 2025

work page arXiv 2025

[20] [20]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML) , 2024

work page 2024

[21] [21]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[22] [22]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024

[23] [23]

Black Forest Labs. Flux.1. https://blackforestlabs.ai/announcing-black-forest-labs/ ,

work page

[24] [24]

Accessed: November 2024

work page 2024

[25] [25]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Stylerig: Rigging stylegan for 3d control over portrait images

Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. Stylerig: Rigging stylegan for 3d control over portrait images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6142–6151, 2020

work page 2020

[27] [27]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Fastspeech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Conference on Neural Information Processing Systems (NeurIPS) , 2019

work page 2019

[29] [29]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020

work page arXiv 2006

[30] [30]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024

work page 2024

[32] [32]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024

work page arXiv 2024

[33] [33]

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023

work page arXiv 2023

[34] [34]

V oicebox: Text-guided multilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sarı, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text-guided multilingual universal speech generation at scale. In Conference on Neural Information Processing Systems (NeurIPS) , 2023

work page 2023

[35] [35]

Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. arXiv preprint arXiv:2406.11427, 2024. 11

work page arXiv 2024

[36] [36]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 682–689. IEEE, 2024

work page 2024

[37] [37]

Learning to dub movies via hierarchical prosody models

Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023

[38] [38]

Styledubber: towards multi-scale style learning for movie dubbing

Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming- Hsuan Yang, Chenggang Yan, and Qingming Huang. Styledubber: towards multi-scale style learning for movie dubbing. arXiv preprint arXiv:2402.12636, 2024

work page arXiv 2024

[39] [39]

V oiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, and David Harwath. V oiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models. arXiv preprint arXiv:2504.02386, 2025

work page arXiv 2025

[40] [40]

Flow-guided one-shot talking face genera- tion with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face genera- tion with a high-resolution audio-visual dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[41] [41]

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2(3), 2023

work page arXiv 2023

[42] [42]

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024

[43] [43]

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation,

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024

[44] [44]

Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks

Jiahui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks. arXiv preprint arXiv:2412.00733, 2024

work page arXiv 2024

[45] [45]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

work page 2015

[46] [46]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications. arXiv preprint arXiv:2409.03283, 2024

work page arXiv 2024

[48] [48]

V oicecraft: Zero-shot speech editing and text-to-speech in the wild

Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. V oicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprint arXiv:2403.16973, 2024

work page arXiv 2024

[49] [49]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In European Conference on Computer Vision (ECCV), 2022

work page 2022

[50] [50]

Celebv-text: A large-scale facial text-video dataset

Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[51] [51]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Conference on Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[52] [52]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[53] [53]

Identity-preserving talking face generation with landmark and appearance priors

Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity-preserving talking face generation with landmark and appearance priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12

work page 2023

[54] [54]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Learning Representations (ICLR), 2023

work page 2023

[55] [55]

Spleeter: a fast and efficient music source separation tool with pre-trained models

Romain Hennequin, Anis Khlif, Felix V oituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020

work page 2020

[56] [56]

Audio-visual speech representation expert for enhanced talking face video generation and evaluation

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Akti, Hazım Kemal Ekenel, and Alexander Waibel. Audio-visual speech representation expert for enhanced talking face video generation and evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2024

work page 2024

[57] [57]

Sidgan: High-resolution dubbed video generation via shift-invariant learning

Urwa Muaz, Wondong Jang, Rohun Tripathi, Santhosh Mani, Wenbin Ouyang, Ravi Teja Gadde, Baris Gecer, Sergio Elizondo, Reza Madad, and Naveen Nair. Sidgan: High-resolution dubbed video generation via shift-invariant learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023

[58] [58]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017

work page 2016

[59] [59]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. 13 A Qualitative Comparisons, User Study, and Discussions A.1 Qualitative Analysis We provide extensive qualitative comparisons across fi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Project input to QKV: (q1, k1, v1) ← attn1.TOQKV (x1) (q2, k2, v2) ← attn2.TOQKV (x2)

work page

[62] [62]

Apply rotary embeddings (if provided): if rope1 exists then (q1, k1) ← APPLYROPE(q1, k1, rope1) if rope2 exists then (q2, k2) ← APPLYROPE(q2, k2, rope2)

work page

[63] [63]

Construct joint token pools: if α1 = 1 then q⋆ 1 ← [q1; q2], k⋆ 1 ← [k1; k2], v⋆ 1 ← [v1; v2] else (q⋆ 1, k⋆ 1, v⋆

work page

[64] [64]

← (q1, k1, v1) if α2 = 1 then q⋆ 2 ← [q2; q1], k⋆ 2 ← [k2; k1], v⋆ 2 ← [v2; v1] else (q⋆ 2, k⋆ 2, v⋆

work page

[65] [65]

Split heads and apply masks: (q⋆ 1, k⋆ 1, v⋆

work page

[66] [66]

← SPLIT HEADS (q⋆ 1, k⋆ 1, v⋆ 1) (q⋆ 2, k⋆ 2, v⋆

work page

[67] [67]

← SPLIT HEADS (q⋆ 2, k⋆ 2, v⋆ 2) if mask1 ̸= ∅ ∧ α1 = 1 then M1 ← CUSTOM DIAGMASK (L1, L2, mask1) else M1 ← ∅ if mask2 ̸= ∅ ∧ α2 = 1 then M2 ← CUSTOM DIAGMASK (L2, L1, mask2) else M2 ← ∅

work page

[68] [68]

Compute scaled dot-product attention: o⋆ 1 ← SDPA(q⋆ 1, k⋆ 1, v⋆ 1, M1) o⋆ 2 ← SDPA(q⋆ 2, k⋆ 2, v⋆ 2, M2)

work page

[69] [69]

Merge heads, trim to original length, and project: o1 ← MERGE HEADS (o⋆ 1)[:, : L1], o1 ← attn1.OUT PROJ(o1) o2 ← MERGE HEADS (o⋆ 2)[:, : L2], o2 ← attn2.OUT PROJ(o2) return (o1, o2) 5

work page