An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance

Wonwoo Jeong

arxiv: 2602.23958 · v2 · pith:RTQMQ6SAnew · submitted 2026-02-27 · 📡 eess.AS · cs.SD

An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance

Wonwoo Jeong This is my paper

classification 📡 eess.AS cs.SD

keywords encoderaudiodetectiondistanceechetencodersprecisionsemantic

0 comments

read the original abstract

Fr\'echet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
cs.SD 2026-04 unverdicted novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
eess.AS 2026-05 unverdicted novelty 6.0

OTAD replaces FAD's frozen embedding cost and Gaussian coupling with a learned Riemannian adapter and Sinkhorn OT, reporting higher MOS Spearman correlation on DCASE 2023 and per-sample diagnostics with AUROC at least 0.86.