Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.
Probing Spatial Structure in Pretrained Audio Representations
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Probing Spatial Structure in Pretrained Audio Representations
Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.