pith. machine review for the scientific record. sign in

arxiv: 2503.10522 · v4 · submitted 2025-03-13 · 💻 cs.MM · cs.CV· cs.LG· cs.SD· eess.AS

Recognition: unknown

AudioX: A Unified Framework for Anything-to-Audio Generation

Authors on Pith no claims yet
classification 💻 cs.MM cs.CVcs.LGcs.SDeess.AS
keywords generationmultimodalaudioaudioxframeworkunifiedsignalsanything-to-audio
0
0 comments X
read the original abstract

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

    cs.MM 2026-04 unverdicted novelty 7.0

    Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...

  2. VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

    cs.SD 2026-04 unverdicted novelty 7.0

    VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

  3. ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

    cs.MM 2026-04 unverdicted novelty 6.0

    ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.