AudioX: A Unified Framework for Anything-to-Audio Generation

· 2025 · cs.MM · arXiv 2503.10522

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

cs.MM · 2026-04-16 · unverdicted · novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

cs.SD · 2026-04-12 · unverdicted · novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

cs.CV · 2025-11-29 · conditional · novelty 7.0

MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.

StereoFoley: Object-Aware Stereo Audio Generation from Video

cs.SD · 2025-09-22 · conditional · novelty 7.0

StereoFoley is an end-to-end video-to-stereo-audio framework that uses a base generative model fine-tuned on synthetic object-tracked data with panning and distance controls to achieve object-aware spatial sound.

WavFlow: Audio Generation in Waveform Space

cs.SD · 2026-05-18 · conditional · novelty 6.0

WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

cs.MM · 2026-04-16 · unverdicted · novelty 6.0

ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.

citing papers explorer

Showing 7 of 7 citing papers.

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery cs.MM · 2026-04-16 · unverdicted · none · ref 62 · internal anchor
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories cs.SD · 2026-04-12 · unverdicted · none · ref 48 · internal anchor
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
Omni2Sound: Towards Unified Video-Text-to-Audio Generation cs.SD · 2026-01-06 · unverdicted · none · ref 14 · internal anchor
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection cs.CV · 2025-11-29 · conditional · none · ref 44 · internal anchor
MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.
StereoFoley: Object-Aware Stereo Audio Generation from Video cs.SD · 2025-09-22 · conditional · none · ref 14 · internal anchor
StereoFoley is an end-to-end video-to-stereo-audio framework that uses a base generative model fine-tuned on synthetic object-tracked data with panning and distance controls to achieve object-aware spatial sound.
WavFlow: Audio Generation in Waveform Space cs.SD · 2026-05-18 · conditional · none · ref 19 · internal anchor
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling cs.MM · 2026-04-16 · unverdicted · none · ref 40 · internal anchor
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.

AudioX: A Unified Framework for Anything-to-Audio Generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer