MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Akio Hayakawa; Alexander Schwing; Ho Kei Cheng; Masato Ishii; Takashi Shibuya; Yuki Mitsufuji

arxiv: 2412.15322 · v2 · pith:LJREVEYHnew · submitted 2024-12-19 · 💻 cs.CV · cs.LG· cs.SD· eess.AS

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng , Masato Ishii , Akio Hayakawa , Takashi Shibuya , Alexander Schwing , Yuki Mitsufuji This is my paper

classification 💻 cs.CV cs.LGcs.SDeess.AS

keywords mmaudioaudiotraininghigh-qualityjointvideoachievesaudio-visual

0 comments

read the original abstract

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
cs.SD 2026-05 unverdicted novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
cs.CV 2026-04 unverdicted novelty 7.0

FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
Wan: Open and Advanced Large-Scale Video Generative Models
cs.CV 2025-03 unverdicted novelty 5.0

Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.