MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Pith reviewed 2026-05-18 08:16 UTC · model grok-4.3
The pith
A fine-tuned video-to-audio generative model achieves superior sound separation from video or text queries while retaining its original generation abilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains 0
What carries the argument
MMAudioSep, the fine-tuned version of a pretrained video-to-audio generative model that enables separation of sounds based on video or text queries.
Load-bearing premise
The relationships between video, text, and audio learned during pretraining can be effectively transferred and adapted to the task of separating individual sounds from mixed audio.
What would settle it
Running the fine-tuned model on pure video-to-audio generation benchmarks and finding that the audio quality or relevance drops substantially compared to the original pretrained model would disprove the claim that original capabilities are retained.
read the original abstract
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MMAudioSep, a generative model for video/text-queried sound separation founded on a pretrained video-to-audio model. It claims efficient training via transfer of learned video/text-audio relationships, superior performance to both deterministic and generative separation baselines, and retention of the original video-to-audio generation capability after fine-tuning for the separation task. The work positions foundational generative models as adaptable bases for downstream sound-related tasks.
Significance. If the central claims hold with quantitative support, the result would demonstrate effective transfer from pretrained generative priors to separation without capability collapse, supporting lower-cost adaptation of foundational audio models and the viability of multi-task generative systems. This aligns with broader trends in leveraging large pretrained models for audio downstream tasks.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments/Evaluation): The claim of superiority over existing separation models (deterministic and generative) is stated without any reported metrics, dataset details, error analysis, or statistical comparisons in the abstract or evaluation summary; this prevents verification of the performance advantage and is load-bearing for the main empirical claim.
- [Results on retention] Results on retention (§4 or §5): The assertion that the model 'retains the ability for original video-to-audio generation' after separation fine-tuning is presented without quantitative pre/post-fine-tuning comparisons (e.g., FAD, CLAP similarity, or equivalent scores on a held-out V2A set). This is load-bearing for the transfer-efficiency story, as unquantified capability loss would undermine the foundational-model adaptation argument.
minor comments (1)
- [Abstract] The GitHub link is provided but no details on reproducibility (e.g., exact fine-tuning hyperparameters or evaluation protocols) are summarized in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on strengthening the empirical presentation. We address each major comment below and will revise the manuscript accordingly to include the requested quantitative details and clarifications.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments/Evaluation): The claim of superiority over existing separation models (deterministic and generative) is stated without any reported metrics, dataset details, error analysis, or statistical comparisons in the abstract or evaluation summary; this prevents verification of the performance advantage and is load-bearing for the main empirical claim.
Authors: We agree that the abstract and evaluation summary would benefit from more explicit quantitative support. The detailed metrics, tables, and comparisons appear in the body of §4, but the high-level summary and abstract currently emphasize qualitative superiority without numbers. In the revised version we will add key performance figures to the abstract (within length constraints) and expand the §4 summary paragraph to state the primary datasets, report main metric values with improvements over baselines, and include a brief note on error analysis and statistical significance testing where performed. revision: yes
-
Referee: [Results on retention] Results on retention (§4 or §5): The assertion that the model 'retains the ability for original video-to-audio generation' after separation fine-tuning is presented without quantitative pre/post-fine-tuning comparisons (e.g., FAD, CLAP similarity, or equivalent scores on a held-out V2A set). This is load-bearing for the transfer-efficiency story, as unquantified capability loss would undermine the foundational-model adaptation argument.
Authors: We acknowledge the value of explicit quantification. The manuscript currently supports retention via continued generation examples and the absence of obvious degradation, yet does not provide direct pre/post numerical comparisons on a held-out V2A benchmark. In the revision we will compute and report such metrics (FAD, CLAP similarity, and any other relevant scores) on a held-out video-to-audio set before and after separation fine-tuning, thereby quantifying retention and reinforcing the transfer-efficiency claim. revision: yes
Circularity Check
No circularity: empirical fine-tuning results independent of inputs by construction
full rationale
The paper introduces MMAudioSep via fine-tuning of a pretrained video-to-audio model for sound separation, claiming efficiency gains from transferred knowledge, superior performance versus baselines, and retention of generative capability. No equations, derivations, or first-principles results appear in the abstract or described content that could reduce to fitted parameters or self-definitions by construction. Evaluations rely on external baseline comparisons rather than internal predictions forced by the training setup itself. The transfer and retention claims are presented as empirical outcomes, with no load-bearing self-citations or ansatzes that collapse the argument to prior inputs. This is a standard empirical ML paper whose central claims remain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a fine-tuning approach that adapts pretrained video-to-audio models for video/text-queried sound separation... MMAudio uses the conditional flow matching (CFM) objective... channel-concatenation conditioning mechanism.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MMAudioSep (pretrain w/frozen) ... retains the ability for original video-to-audio generation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Sound separation (SS) models that use conditional information to control model behavior have been widely discussed, including visual-queried separation [1, 2], text-queried separation [3, 4, 5], and omni-modality separation [6]. Recently, sound separation based on a generative approach [5] has been presented, whereas most sound separation mod...
-
[2]
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
RELA TED WORK 2.1. Sound Separation Sound separation reconstructs individual sound sources from mixed signals, crucial for multi-speaker environments, music production, and assistive listening [8]. Recent advances include source extrac- tion using auxiliary information and hybrid methods that combine signal processing with neural networks. Text-Queried So...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
MMAUDIOSEP 3.1. MMAudio MMAudio [7] is a generative model that uses flow matching to syn- thesize audio of an input video with optional text conditions. The model operates in a latent space where audio waveforms are encoded via a pretrained variational autoencoder (V AE). In this section, we describe the overview of MMAudio, which serves as the foundation...
-
[4]
EXPERIMENTS 4.1. Experimental Setup Training Dataset.To train MMAudioSep, we utilized the same dataset employed for the pretrained MMAudio, which totals approx- imately 2,500 hours. This includes 400 hours for Video+Audio+Label and 2,100 hours for Audio+Text. We train on VGGSound [24] as the video-audio-text dataset. We use the first 8s of each video for ...
-
[5]
For short audios (<16s), we truncate them to 8s for training, as in VGGSound
as the audio-text datasets. For short audios (<16s), we truncate them to 8s for training, as in VGGSound. For longer audios, we take up to five non-overlapping crops of 8s each. This results in a total of 951K audio clip-text pairs. Evaluation Dataset.The AudioSep evaluation dataset is publicly accessible. Since our method also utilizes video queries, our...
-
[6]
Each file contains mixture and target sound
and 1,000 samples from the MUSIC dataset [9], which were both taken from the AudioSep website 1. Each file contains mixture and target sound. In the VGGSound-Clean dataset, target and noise signals were sampled between -35dB and -25dB LUFS (Loudness Units Full Scale) and mixed. The average signal-to-noise ratio (SNR) of the VGGSound-Clean dataset is aroun...
-
[7]
CONCLUSION This paper introduces MMAudioSep, an innovative approach to video/text-queried sound separation that leverages a pretrained video-to-audio generation model. The MMAudioSep model ad- vances in multimodal sound separation by utilizing a fine-tuned pretrained MMAudio model with channel-concatenation condition- ing. Its primary contributions includ...
-
[8]
Clipsep: Learning text-queried sound separation with noisy unlabeled videos,
Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, and Taylor Berg-Kirkpatrick, “Clipsep: Learning text-queried sound separation with noisy unlabeled videos,” inICLR, 2023
work page 2023
-
[9]
iquery: Instruments as queries for audio-visual sound separation,
Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, and Jianbo Shi, “iquery: Instruments as queries for audio-visual sound separation,” inCVPR, 2023
work page 2023
-
[10]
Separate what you describe: Language-queried audio source separation,
Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D Plumbley, and Wenwu Wang, “Separate what you describe: Language-queried audio source separation,” inInter- speech, 2022
work page 2022
-
[11]
Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang, “Separate anything you describe,”arXiv preprint arXiv:2308.05037, 2023
-
[12]
Flowsep: Language-queried sound separation with rectified flow matching,
Yi Yuan, Xubo Liu, Haohe Liu, Mark D Plumbley, and Wenwu Wang, “Flowsep: Language-queried sound separation with rectified flow matching,” inICASSP, 2025
work page 2025
-
[13]
Omnisep: Unified omni-modality sound separation with query- mixup,
Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Shengpeng Ji, Jialong Zuo, Tao Jin, and Zhou Zhao, “Omnisep: Unified omni-modality sound separation with query- mixup,” inICLR, 2025
work page 2025
-
[14]
MMAudio: Taming mul- timodal joint training for high-quality video-to-audio synthesis,
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji, “MMAudio: Taming mul- timodal joint training for high-quality video-to-audio synthesis,” in CVPR, 2025
work page 2025
-
[15]
30+ years of source separation research: Achievements and future challenges,
Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, and Yuki Mitsufuji, “30+ years of source separation research: Achievements and future challenges,” inICASSP, 2025
work page 2025
-
[16]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl V ondrick, Josh McDermott, and Antonio Torralba, “The sound of pixels,” inECCV, 2018
work page 2018
-
[17]
Co-separating sounds of visual objects,
Ruohan Gao and Kristen Grauman, “Co-separating sounds of visual objects,” inICCV, 2019
work page 2019
-
[18]
Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba, “The sound of motions,” inICCV, 2019
work page 2019
-
[19]
Lingyu Zhu and Esa Rahtu, “Visually guided sound source separation and localization using self-supervised motion representations,”arXiv preprint arXiv:2104.08506, 2021
-
[20]
Learn- ing transferable visual models from natural language supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learn- ing transferable visual models from natural language supervision,” in ICML, 2021
work page 2021
-
[21]
Scal- ing rectified flow transformers for high-resolution image synthesis,
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach, “Scal- ing rectified flow transformers for high-resolution image synthesis,” in ICML, 2024
work page 2024
-
[22]
Bigvgan: A universal neural vocoder with large-scale training,
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” inICLR, 2023
work page 2023
-
[23]
Read, watch and scream! sound generation from text and video
Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee, “Read, watch and scream! sound generation from text and video,”arXiv preprint arXiv:2407.05551, 2024
-
[24]
Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,
Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, and Qifeng Chen, “Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,” inCVPR, 2024
work page 2024
-
[25]
Temporally aligned audio for video with autoregression,
Ilpo Viertola, Vladimir Iashin, and Esa Rahtu, “Temporally aligned audio for video with autoregression,” inICASSP, 2025
work page 2025
-
[26]
Tell what you hear from what you see–video to audio generation through text,
Xiulong Liu, Kun Su, and Eli Shlizerman, “Tell what you hear from what you see–video to audio generation through text,”arXiv preprint arXiv:2411.05679, 2024
-
[27]
Frieren: Efficient video-to-audio generation network with rectified flow matching,
Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao, “Frieren: Efficient video-to-audio generation network with rectified flow matching,” in NeurIPS, 2024
work page 2024
-
[28]
Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds,
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen, “Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds,”arXiv preprint arXiv:2407.01494, 2024
-
[29]
V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,
Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai, “V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,” inAAAI, 2024
work page 2024
-
[30]
Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, and Yuki Mitsufuji, “Specmaskfoley: Steering pretrained spectral masked generative transformer toward synchronized video- to-audio synthesis via controlnet,”arXiv preprint arXiv:2505.16195, 2025
-
[31]
Vggsound: A large-scale audio-visual dataset,
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman, “Vggsound: A large-scale audio-visual dataset,” inICASSP, 2020
work page 2020
-
[32]
Audiocaps: Generating captions for audios in the wild,
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019
work page 2019
-
[33]
Clotho: An audio captioning dataset,
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen, “Clotho: An audio captioning dataset,” inICASSP, 2020
work page 2020
-
[34]
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, and Wenwu Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,”IEEE/ACM TASLP, 2024
work page 2024
-
[35]
Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Shar- ifi, “Fr ´echet audio distance: A metric for evaluating music enhance- ment algorithms,”arXiv preprint arXiv:1812.08466, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg- Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language- audio pretraining with feature fusion and keyword-to-caption augmen- tation,” inICASSP, 2023
work page 2023
-
[37]
Efficient training of audio transformers with patchout,
Khaled Koutini, Jan Schl ¨uter, Hamid Eghbal-Zadeh, and Gerhard Wid- mer, “Efficient training of audio transformers with patchout,”arXiv preprint arXiv:2110.05069, 2021
-
[38]
Cnn architectures for large-scale audio classification,
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn architectures for large-scale audio classification,” inICASSP, 2017
work page 2017
-
[39]
Panns: Large-scale pretrained audio neural networks for audio pattern recognition,
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM TASLP, vol. 28, pp. 2880–2894, 2020
work page 2020
-
[40]
Improved techniques for training gans,
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” inNeurIPS, 2016
work page 2016
-
[41]
Imagebind: One embedding space to bind them all,
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra, “Imagebind: One embedding space to bind them all,” inCVPR, 2023
work page 2023
-
[42]
Synchformer: Efficient synchronization from sparse cues,
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman, “Synchformer: Efficient synchronization from sparse cues,” in ICASSP, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.