CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextual stages plus joint regularization.
arXiv preprint arXiv:2504.02386 (2025)
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
citing papers explorer
-
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextual stages plus joint regularization.
-
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.