Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Dominik Roblek; Kevin Kilgour; Matthew Sharifi; Mauricio Zuluaga

arxiv: 1812.08466 · v4 · pith:XMDYSOB4new · submitted 2018-12-20 · 📡 eess.AS · cs.SD

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Kevin Kilgour , Mauricio Zuluaga , Dominik Roblek , Matthew Sharifi This is my paper

classification 📡 eess.AS cs.SD

keywords distanceaudioechetenhancementmetricalgorithmscorrelationcosine

0 comments

read the original abstract

We propose the Fr\'echet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fr\'echet Inception Distance (FID) metric used to evaluate generative image models to the audio domain. FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15 and -0.01 respectively.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
cs.CV 2026-05 unverdicted novelty 7.0

InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after buildin...
Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems
cs.SD 2026-05 unverdicted novelty 7.0

MixtureTT performs direct per-stem timbre transfer on polyphonic mixtures via a shared diffusion transformer, outperforming single-stem baselines on SATB choral data while eliminating cascaded separation errors.
Latent Fourier Transform
cs.SD 2026-04 unverdicted novelty 7.0

LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
cs.CV 2026-04 unverdicted novelty 7.0

FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
cs.SD 2026-04 unverdicted novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Stage-adaptive audio diffusion modeling
cs.SD 2026-05 unverdicted novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
cs.SD 2025-02 unverdicted novelty 6.0

Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods
cs.SD 2026-05 accept novelty 5.0

The paper introduces the ATTM Grand Challenge with a CC-licensed instrumental subset of MTG-Jamendo, two tracks, and evaluation via FAD, CLAP, and a new Concept Coverage Score to support academic text-to-music research.
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
cs.SD 2026-05 unverdicted novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
cs.SD 2025-10 unverdicted novelty 4.0

MMAudioSep adapts a pretrained video-to-audio model via fine-tuning for video/text-queried sound separation, outperforming baselines while preserving generation ability.