SemanticAudio: Audio Generation and Editing in Semantic Space

· 2026 · eess.AS · arXiv 2601.21402

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: https://semanticaudio1.github.io/

representative citing papers

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

cs.SD · 2026-06-18 · unverdicted · novelty 6.0 · 2 refs

Hybrid two-stage diffusion transformer architecture for instruction-guided audio editing via rectified flow that performs joint attention at low resolution then alternates joint and cross-attention at high resolution for improved performance and efficiency.

citing papers explorer

Showing 1 of 1 citing paper.

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow cs.SD · 2026-06-18 · unverdicted · none · ref 45 · 2 links · internal anchor
Hybrid two-stage diffusion transformer architecture for instruction-guided audio editing via rectified flow that performs joint attention at low resolution then alternates joint and cross-attention at high resolution for improved performance and efficiency.

fields

years

verdicts

representative citing papers

citing papers explorer