Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
read the original abstract
We propose the Fr\'echet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fr\'echet Inception Distance (FID) metric used to evaluate generative image models to the audio domain. FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15 and -0.01 respectively.
This paper has not been read by Pith yet.
Forward citations
Cited by 11 Pith papers
-
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after buildin...
-
Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems
MixtureTT performs direct per-stem timbre transfer on polyphonic mixtures via a shared diffusion transformer, outperforming single-stem baselines on SATB choral data while eliminating cascaded separation errors.
-
Latent Fourier Transform
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
-
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
-
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
-
Stage-adaptive audio diffusion modeling
A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
-
Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods
The paper introduces the ATTM Grand Challenge with a CC-licensed instrumental subset of MTG-Jamendo, two tracks, and evaluation via FAD, CLAP, and a new Concept Coverage Score to support academic text-to-music research.
-
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
MMAudioSep adapts a pretrained video-to-audio model via fine-tuning for video/text-queried sound separation, outperforming baselines while preserving generation ability.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.