pith. machine review for the scientific record. sign in

arxiv: 1812.08466 · v4 · submitted 2018-12-20 · 📡 eess.AS · cs.SD

Recognition: unknown

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Authors on Pith no claims yet
classification 📡 eess.AS cs.SD
keywords distanceaudioechetenhancementmetricalgorithmscorrelationcosine
0
0 comments X
read the original abstract

We propose the Fr\'echet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fr\'echet Inception Distance (FID) metric used to evaluate generative image models to the audio domain. FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15 and -0.01 respectively.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

    cs.SD 2026-05 unverdicted novelty 7.0

    MixtureTT performs direct per-stem timbre transfer on polyphonic mixtures via a shared diffusion transformer, outperforming single-stem baselines on SATB choral data while eliminating cascaded separation errors.

  2. Latent Fourier Transform

    cs.SD 2026-04 unverdicted novelty 7.0

    LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.

  3. FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

    cs.CV 2026-04 unverdicted novelty 7.0

    FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.

  4. OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

    cs.SD 2026-04 unverdicted novelty 7.0

    OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.

  5. Stage-adaptive audio diffusion modeling

    cs.SD 2026-05 unverdicted novelty 6.0

    A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.

  6. Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

    cs.SD 2026-05 unverdicted novelty 5.0

    A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...

  7. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.