An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance
read the original abstract
Fr\'echet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
OTAD replaces FAD's frozen embedding cost and Gaussian coupling with a learned Riemannian adapter and Sinkhorn OT, reporting higher MOS Spearman correlation on DCASE 2023 and per-sample diagnostics with AUROC at least 0.86.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.