pith. sign in

arxiv: 2510.09065 · v2 · submitted 2025-10-10 · 💻 cs.SD · cs.CV· cs.LG· eess.AS

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Pith reviewed 2026-05-18 08:16 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.LGeess.AS
keywords modelgenerativemmaudiosepmodelsseparationsoundvideovideo-to-audio
0
0 comments X

The pith

A fine-tuned video-to-audio generative model achieves superior sound separation from video or text queries while retaining its original generation abilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MMAudioSep as a way to adapt pretrained video-to-audio models for the task of separating sounds in videos using video or text as queries. Instead of training a separation model from the beginning, it uses the knowledge already captured in the generative model to train more efficiently. Tests show this model performs better than previous separation approaches that use either fixed or generative methods. After the fine-tuning step, the same model can still generate audio directly from video and text inputs. This suggests that large pretrained sound models can serve as a base for multiple audio processing jobs.

Core claim

We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains 0

What carries the argument

MMAudioSep, the fine-tuned version of a pretrained video-to-audio generative model that enables separation of sounds based on video or text queries.

Load-bearing premise

The relationships between video, text, and audio learned during pretraining can be effectively transferred and adapted to the task of separating individual sounds from mixed audio.

What would settle it

Running the fine-tuned model on pure video-to-audio generation benchmarks and finding that the audio quality or relevance drops substantially compared to the original pretrained model would disprove the claim that original capabilities are retained.

read the original abstract

We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MMAudioSep, a generative model for video/text-queried sound separation founded on a pretrained video-to-audio model. It claims efficient training via transfer of learned video/text-audio relationships, superior performance to both deterministic and generative separation baselines, and retention of the original video-to-audio generation capability after fine-tuning for the separation task. The work positions foundational generative models as adaptable bases for downstream sound-related tasks.

Significance. If the central claims hold with quantitative support, the result would demonstrate effective transfer from pretrained generative priors to separation without capability collapse, supporting lower-cost adaptation of foundational audio models and the viability of multi-task generative systems. This aligns with broader trends in leveraging large pretrained models for audio downstream tasks.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments/Evaluation): The claim of superiority over existing separation models (deterministic and generative) is stated without any reported metrics, dataset details, error analysis, or statistical comparisons in the abstract or evaluation summary; this prevents verification of the performance advantage and is load-bearing for the main empirical claim.
  2. [Results on retention] Results on retention (§4 or §5): The assertion that the model 'retains the ability for original video-to-audio generation' after separation fine-tuning is presented without quantitative pre/post-fine-tuning comparisons (e.g., FAD, CLAP similarity, or equivalent scores on a held-out V2A set). This is load-bearing for the transfer-efficiency story, as unquantified capability loss would undermine the foundational-model adaptation argument.
minor comments (1)
  1. [Abstract] The GitHub link is provided but no details on reproducibility (e.g., exact fine-tuning hyperparameters or evaluation protocols) are summarized in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical presentation. We address each major comment below and will revise the manuscript accordingly to include the requested quantitative details and clarifications.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments/Evaluation): The claim of superiority over existing separation models (deterministic and generative) is stated without any reported metrics, dataset details, error analysis, or statistical comparisons in the abstract or evaluation summary; this prevents verification of the performance advantage and is load-bearing for the main empirical claim.

    Authors: We agree that the abstract and evaluation summary would benefit from more explicit quantitative support. The detailed metrics, tables, and comparisons appear in the body of §4, but the high-level summary and abstract currently emphasize qualitative superiority without numbers. In the revised version we will add key performance figures to the abstract (within length constraints) and expand the §4 summary paragraph to state the primary datasets, report main metric values with improvements over baselines, and include a brief note on error analysis and statistical significance testing where performed. revision: yes

  2. Referee: [Results on retention] Results on retention (§4 or §5): The assertion that the model 'retains the ability for original video-to-audio generation' after separation fine-tuning is presented without quantitative pre/post-fine-tuning comparisons (e.g., FAD, CLAP similarity, or equivalent scores on a held-out V2A set). This is load-bearing for the transfer-efficiency story, as unquantified capability loss would undermine the foundational-model adaptation argument.

    Authors: We acknowledge the value of explicit quantification. The manuscript currently supports retention via continued generation examples and the absence of obvious degradation, yet does not provide direct pre/post numerical comparisons on a held-out V2A benchmark. In the revision we will compute and report such metrics (FAD, CLAP similarity, and any other relevant scores) on a held-out video-to-audio set before and after separation fine-tuning, thereby quantifying retention and reinforcing the transfer-efficiency claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning results independent of inputs by construction

full rationale

The paper introduces MMAudioSep via fine-tuning of a pretrained video-to-audio model for sound separation, claiming efficiency gains from transferred knowledge, superior performance versus baselines, and retention of generative capability. No equations, derivations, or first-principles results appear in the abstract or described content that could reduce to fitted parameters or self-definitions by construction. Evaluations rely on external baseline comparisons rather than internal predictions forced by the training setup itself. The transfer and retention claims are presented as empirical outcomes, with no load-bearing self-citations or ansatzes that collapse the argument to prior inputs. This is a standard empirical ML paper whose central claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; evaluation relies on unspecified pretrained model and comparison baselines.

pith-pipeline@v0.9.0 · 5689 in / 901 out tokens · 25477 ms · 2026-05-18T08:16:09.907001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    Recently, sound separation based on a generative approach [5] has been presented, whereas most sound separation models have been implemented as discriminative tasks

    INTRODUCTION Sound separation (SS) models that use conditional information to control model behavior have been widely discussed, including visual-queried separation [1, 2], text-queried separation [3, 4, 5], and omni-modality separation [6]. Recently, sound separation based on a generative approach [5] has been presented, whereas most sound separation mod...

  2. [2]

    MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

    RELA TED WORK 2.1. Sound Separation Sound separation reconstructs individual sound sources from mixed signals, crucial for multi-speaker environments, music production, and assistive listening [8]. Recent advances include source extrac- tion using auxiliary information and hybrid methods that combine signal processing with neural networks. Text-Queried So...

  3. [3]

    MMAudio MMAudio [7] is a generative model that uses flow matching to syn- thesize audio of an input video with optional text conditions

    MMAUDIOSEP 3.1. MMAudio MMAudio [7] is a generative model that uses flow matching to syn- thesize audio of an input video with optional text conditions. The model operates in a latent space where audio waveforms are encoded via a pretrained variational autoencoder (V AE). In this section, we describe the overview of MMAudio, which serves as the foundation...

  4. [4]

    Experimental Setup Training Dataset.To train MMAudioSep, we utilized the same dataset employed for the pretrained MMAudio, which totals approx- imately 2,500 hours

    EXPERIMENTS 4.1. Experimental Setup Training Dataset.To train MMAudioSep, we utilized the same dataset employed for the pretrained MMAudio, which totals approx- imately 2,500 hours. This includes 400 hours for Video+Audio+Label and 2,100 hours for Audio+Text. We train on VGGSound [24] as the video-audio-text dataset. We use the first 8s of each video for ...

  5. [5]

    For short audios (<16s), we truncate them to 8s for training, as in VGGSound

    as the audio-text datasets. For short audios (<16s), we truncate them to 8s for training, as in VGGSound. For longer audios, we take up to five non-overlapping crops of 8s each. This results in a total of 951K audio clip-text pairs. Evaluation Dataset.The AudioSep evaluation dataset is publicly accessible. Since our method also utilizes video queries, our...

  6. [6]

    Each file contains mixture and target sound

    and 1,000 samples from the MUSIC dataset [9], which were both taken from the AudioSep website 1. Each file contains mixture and target sound. In the VGGSound-Clean dataset, target and noise signals were sampled between -35dB and -25dB LUFS (Loudness Units Full Scale) and mixed. The average signal-to-noise ratio (SNR) of the VGGSound-Clean dataset is aroun...

  7. [7]

    The MMAudioSep model ad- vances in multimodal sound separation by utilizing a fine-tuned pretrained MMAudio model with channel-concatenation condition- ing

    CONCLUSION This paper introduces MMAudioSep, an innovative approach to video/text-queried sound separation that leverages a pretrained video-to-audio generation model. The MMAudioSep model ad- vances in multimodal sound separation by utilizing a fine-tuned pretrained MMAudio model with channel-concatenation condition- ing. Its primary contributions includ...

  8. [8]

    Clipsep: Learning text-queried sound separation with noisy unlabeled videos,

    Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, and Taylor Berg-Kirkpatrick, “Clipsep: Learning text-queried sound separation with noisy unlabeled videos,” inICLR, 2023

  9. [9]

    iquery: Instruments as queries for audio-visual sound separation,

    Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, and Jianbo Shi, “iquery: Instruments as queries for audio-visual sound separation,” inCVPR, 2023

  10. [10]

    Separate what you describe: Language-queried audio source separation,

    Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D Plumbley, and Wenwu Wang, “Separate what you describe: Language-queried audio source separation,” inInter- speech, 2022

  11. [11]

    Plumbley, and Wenwu Wang

    Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang, “Separate anything you describe,”arXiv preprint arXiv:2308.05037, 2023

  12. [12]

    Flowsep: Language-queried sound separation with rectified flow matching,

    Yi Yuan, Xubo Liu, Haohe Liu, Mark D Plumbley, and Wenwu Wang, “Flowsep: Language-queried sound separation with rectified flow matching,” inICASSP, 2025

  13. [13]

    Omnisep: Unified omni-modality sound separation with query- mixup,

    Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Shengpeng Ji, Jialong Zuo, Tao Jin, and Zhou Zhao, “Omnisep: Unified omni-modality sound separation with query- mixup,” inICLR, 2025

  14. [14]

    MMAudio: Taming mul- timodal joint training for high-quality video-to-audio synthesis,

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji, “MMAudio: Taming mul- timodal joint training for high-quality video-to-audio synthesis,” in CVPR, 2025

  15. [15]

    30+ years of source separation research: Achievements and future challenges,

    Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, and Yuki Mitsufuji, “30+ years of source separation research: Achievements and future challenges,” inICASSP, 2025

  16. [16]

    The sound of pixels,

    Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl V ondrick, Josh McDermott, and Antonio Torralba, “The sound of pixels,” inECCV, 2018

  17. [17]

    Co-separating sounds of visual objects,

    Ruohan Gao and Kristen Grauman, “Co-separating sounds of visual objects,” inICCV, 2019

  18. [18]

    The sound of motions,

    Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba, “The sound of motions,” inICCV, 2019

  19. [19]

    Visually guided sound source separation and localization using self-supervised motion representations,

    Lingyu Zhu and Esa Rahtu, “Visually guided sound source separation and localization using self-supervised motion representations,”arXiv preprint arXiv:2104.08506, 2021

  20. [20]

    Learn- ing transferable visual models from natural language supervision,

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learn- ing transferable visual models from natural language supervision,” in ICML, 2021

  21. [21]

    Scal- ing rectified flow transformers for high-resolution image synthesis,

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach, “Scal- ing rectified flow transformers for high-resolution image synthesis,” in ICML, 2024

  22. [22]

    Bigvgan: A universal neural vocoder with large-scale training,

    Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” inICLR, 2023

  23. [23]

    Read, watch and scream! sound generation from text and video

    Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee, “Read, watch and scream! sound generation from text and video,”arXiv preprint arXiv:2407.05551, 2024

  24. [24]

    Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,

    Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, and Qifeng Chen, “Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,” inCVPR, 2024

  25. [25]

    Temporally aligned audio for video with autoregression,

    Ilpo Viertola, Vladimir Iashin, and Esa Rahtu, “Temporally aligned audio for video with autoregression,” inICASSP, 2025

  26. [26]

    Tell what you hear from what you see–video to audio generation through text,

    Xiulong Liu, Kun Su, and Eli Shlizerman, “Tell what you hear from what you see–video to audio generation through text,”arXiv preprint arXiv:2411.05679, 2024

  27. [27]

    Frieren: Efficient video-to-audio generation network with rectified flow matching,

    Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao, “Frieren: Efficient video-to-audio generation network with rectified flow matching,” in NeurIPS, 2024

  28. [28]

    Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds,

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen, “Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds,”arXiv preprint arXiv:2407.01494, 2024

  29. [29]

    V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,

    Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai, “V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,” inAAAI, 2024

  30. [30]

    Specmaskfoley: Steering pretrained spectral masked generative transformer toward synchronized video- to-audio synthesis via controlnet,

    Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, and Yuki Mitsufuji, “Specmaskfoley: Steering pretrained spectral masked generative transformer toward synchronized video- to-audio synthesis via controlnet,”arXiv preprint arXiv:2505.16195, 2025

  31. [31]

    Vggsound: A large-scale audio-visual dataset,

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman, “Vggsound: A large-scale audio-visual dataset,” inICASSP, 2020

  32. [32]

    Audiocaps: Generating captions for audios in the wild,

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019

  33. [33]

    Clotho: An audio captioning dataset,

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen, “Clotho: An audio captioning dataset,” inICASSP, 2020

  34. [34]

    WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, and Wenwu Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,”IEEE/ACM TASLP, 2024

  35. [35]

    Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

    Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Shar- ifi, “Fr ´echet audio distance: A metric for evaluating music enhance- ment algorithms,”arXiv preprint arXiv:1812.08466, 2018

  36. [36]

    Large-scale contrastive language- audio pretraining with feature fusion and keyword-to-caption augmen- tation,

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg- Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language- audio pretraining with feature fusion and keyword-to-caption augmen- tation,” inICASSP, 2023

  37. [37]

    Efficient training of audio transformers with patchout,

    Khaled Koutini, Jan Schl ¨uter, Hamid Eghbal-Zadeh, and Gerhard Wid- mer, “Efficient training of audio transformers with patchout,”arXiv preprint arXiv:2110.05069, 2021

  38. [38]

    Cnn architectures for large-scale audio classification,

    Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn architectures for large-scale audio classification,” inICASSP, 2017

  39. [39]

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM TASLP, vol. 28, pp. 2880–2894, 2020

  40. [40]

    Improved techniques for training gans,

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” inNeurIPS, 2016

  41. [41]

    Imagebind: One embedding space to bind them all,

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra, “Imagebind: One embedding space to bind them all,” inCVPR, 2023

  42. [42]

    Synchformer: Efficient synchronization from sparse cues,

    Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman, “Synchformer: Efficient synchronization from sparse cues,” in ICASSP, 2024