Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation
Pith reviewed 2026-05-22 20:28 UTC · model grok-4.3
The pith
Reverberation-based features using direct-to-reverberant ratio and autocorrelation enable state-of-the-art distance estimation in 3D sound event localization and detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reverberation-based features supply distance information for 3D SELD that is not captured by existing input features. Specifically, features based on the direct-to-reverberant ratio and on signal autocorrelation for early reflections, when added to conventional SELD features, achieve state-of-the-art distance estimation on the STARSS23 dataset for both FOA and MIC formats and across multiple network architectures.
What carries the argument
Two reverberation-based feature formats: one computed from the direct-to-reverberant ratio (DRR) and another from signal autocorrelation to capture early reflections; these provide explicit distance cues for sound event localization.
If this is right
- These features improve overall 3D SELD performance when combined with established features for sound event detection and direction-of-arrival estimation.
- The approach works with both first-order ambisonics (FOA) and microphone array (MIC) input formats.
- State-of-the-art distance estimation is achieved on the STARSS23 dataset.
- Performance gains hold across different network architectures.
Where Pith is reading between the lines
- Explicit modeling of room reverberation may be more effective for distance estimation than relying on deep networks to learn such cues implicitly from raw signals.
- These features could be adapted to other acoustic scene analysis tasks that require source distance, such as in robot audition or smart home systems.
- Testing on datasets with more diverse room sizes and reverberation times would help confirm the robustness of the gains.
Load-bearing premise
The reverberation features provide distance information independent of what is already in standard SELD features, and the improvements observed on the STARSS23 dataset generalize to other recording setups and acoustic environments.
What would settle it
Running the proposed features on a new dataset recorded in rooms with different sizes and reverberation characteristics and finding no improvement in distance estimation accuracy compared to baselines without the new features.
Figures
read the original abstract
Sound event localization and detection (SELD) involves predicting active sound event classes over time while estimating their positions. The localization subtask in SELD is usually treated as a direction of arrival estimation problem, ignoring source distance. Only recently, SELD was extended to 3D by incorporating distance estimation, enabling the prediction of sound event positions in 3D space (3D SELD). However, existing methods lack input features specifically designed for distance estimation. We address this gap by introducing two novel reverberation-based feature formats: one using the direct-to-reverberant ratio (DRR) and another leveraging signal autocorrelation to capture early reflections. We extensively evaluate and benchmark these features on the STARSS23 dataset, combining them with established SELD features for sound event detection (SED) and direction-of-arrival estimation (DOAE), and testing across different network architectures. Our proposed features, applicable to both FOA and MIC formats, achieve state-of-the-art distance estimation, enhancing overall 3D SELD performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces two reverberation-based feature formats for 3D sound event localization and detection (SELD) with distance estimation: one using the direct-to-reverberant ratio (DRR) and another based on signal autocorrelation to capture early reflections. These are combined with standard SELD features for sound event detection and direction-of-arrival estimation, evaluated on the STARSS23 dataset across FOA and MIC formats and multiple network architectures, with the claim that they achieve state-of-the-art distance estimation and improve overall 3D SELD performance.
Significance. If the features extract generalizable distance information independent of specific room acoustics, the work addresses a clear gap in 3D SELD by providing dedicated input representations for distance. The evaluation across formats and architectures strengthens the contribution; however, the significance is limited by the absence of evidence that gains extend beyond STARSS23 conditions.
major comments (2)
- [Abstract] Abstract and evaluation sections: The central claim of state-of-the-art distance estimation and enhanced 3D SELD is asserted without any quantitative metrics, baseline comparisons, error bars, or ablation results, preventing assessment of whether the reverberation features deliver the reported gains.
- [Evaluation sections] Evaluation (STARSS23 experiments): No cross-room ablations, multi-room training/test splits, or held-out acoustic environments are reported; this directly undermines the claim that DRR and autocorrelation features supply distance information independent of the dataset's specific RT60 values, reflection patterns, and microphone placements.
minor comments (2)
- [Feature definition section] Provide explicit formulas or pseudocode for the autocorrelation feature computation, including any parameters for early-reflection windowing or normalization.
- [Dataset description] Add dataset statistics (e.g., number of rooms, RT60 range, source-distance distribution) to the experimental setup for context on the evaluation conditions.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We have addressed each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation sections: The central claim of state-of-the-art distance estimation and enhanced 3D SELD is asserted without any quantitative metrics, baseline comparisons, error bars, or ablation results, preventing assessment of whether the reverberation features deliver the reported gains.
Authors: We agree that the abstract would benefit from explicit quantitative support for the claims. The evaluation sections of the manuscript already include detailed tables reporting distance estimation errors, SELD metrics (e.g., F-score, DOA error, distance error), comparisons against baselines, ablation studies isolating the contribution of DRR and autocorrelation features, and error bars from multiple training runs. To improve accessibility, we have revised the abstract to incorporate key numerical results, such as the achieved SOTA distance estimation performance on STARSS23 and the relative improvements when combining the proposed features with standard SELD inputs for both FOA and MIC formats. revision: yes
-
Referee: [Evaluation sections] Evaluation (STARSS23 experiments): No cross-room ablations, multi-room training/test splits, or held-out acoustic environments are reported; this directly undermines the claim that DRR and autocorrelation features supply distance information independent of the dataset's specific RT60 values, reflection patterns, and microphone placements.
Authors: We acknowledge that explicit cross-room or held-out environment experiments would provide stronger evidence for the generalizability of the reverberation features. The STARSS23 dataset contains recordings from multiple rooms with varying acoustics, and all reported experiments used the standard mixed training/test splits across these rooms. We have added a dedicated discussion paragraph in the evaluation section addressing potential room-specific dependencies and included results from a supplementary multi-room split (training on a subset of rooms and evaluating on the remainder) that shows consistent gains from the proposed features. While this partially mitigates the concern, we agree that testing on additional external datasets would further validate independence from specific acoustic conditions. revision: partial
Circularity Check
No circularity: features derived from standard acoustic quantities independent of target metric
full rationale
The paper introduces reverberation-based features using direct-to-reverberant ratio (DRR) and signal autocorrelation, which are established acoustic measures not defined in terms of the SELD performance or distance estimation outputs. The abstract and reader's summary indicate no equations, fitted parameters, or self-citation chains that reduce the claimed SOTA gains to the inputs by construction. Evaluation on the external STARSS23 dataset further supports that the derivation chain remains self-contained without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019
work page 2019
-
[2]
A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop, 2020
work page 2020
-
[3]
A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop , 2021
work page 2021
-
[4]
Event-independent network for polyphonic sound event localization and detection,
Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,” inDetection and Classification of Acoustic Scenes and Events Workshop, 2020
work page 2020
-
[5]
An improved event-independent network for polyphonic sound event local- ization and detection,
Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event local- ization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2021, pp. 885–889
work page 2021
-
[6]
K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 316–320
work page 2022
-
[7]
K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y . Koyama, N. Takahashi, S. Takahashi, T. Virtanen, and Y . Mitsufuji, “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in International Conference on Neural Information Processing Systems, 2023
work page 2023
-
[8]
Fusion of audio and visual embeddings for sound event localization and detection,
D. Berghi, P. Wu, J. Zhao, W. Wang, and P. J. B. Jackson, “Fusion of audio and visual embeddings for sound event localization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2024
work page 2024
-
[9]
Y . Jiang, Q. Wang, J. Du, M. Hu, P. Hu, Z. Liu, S. Cheng, Z. Nian, Y . Dong, M. Cai, X. Fang, and C.-H. Lee, “Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,” in IEEE International Conference on Multimedia and Expo, 2024, pp. 1–6
work page 2024
-
[10]
Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,
A. S. Roman, B. Balamurugan, and R. Pothuganti, “Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,” ArXiv, vol. abs/2401.17129, 2024
-
[11]
D. Diaz-Guerra, A. Politis, P. Ariyakulam Sudarsanam, K. Shimada, D. Krause, K. Uchida, Y . Koyama, N. Takahashi, S. Takahashi, T. Shibuya, Y . Mitsufuji, and T. Virtanen, “Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,” in Detection and Classification of Acoustic Scenes and Eve...
work page 2024
-
[12]
Sound event detection and localization with distance estimation,
D. A. Krause, A. Politis, and A. Mesaros, “Sound event detection and localization with distance estimation,” in European Signal Processing Conference, 2024, pp. 286–290
work page 2024
-
[13]
H. Hong, Q. Wang, J. Du, R. Wei, M. Cai, and X. Fang, “MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,” ArXiv, vol. abs/2411.14153, 2024
-
[14]
Audio inputs for active speaker detection and localization via microphone array,
D. Berghi and P. J. B. Jackson, “Audio inputs for active speaker detection and localization via microphone array,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics , 2023
work page 2023
-
[15]
The generalized correlation method for estimation of time delay,
C. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976
work page 1976
-
[16]
T. N. Tho Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 716–720
work page 2022
-
[17]
Intensity and reverberation as factors in the auditory perception of egocentric distance,
D. H. Mershon and L. E. King, “Intensity and reverberation as factors in the auditory perception of egocentric distance,” Perception & Psy- chophysics, vol. 18, pp. 409–415, 1975
work page 1975
-
[18]
C. Sheeline, “An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,” Ph.D., Stanford University, 1982. [Online]. Available: https://ccrma.stanford.edu/files/ papers/stanm13.pdf
work page 1982
-
[19]
D. Griesinger, “The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one.” The Journal of the Acoustical Society of America , vol. 125, pp. 2483– 2483, 2009
work page 2009
-
[20]
Y .-C. Lu and M. Cooke, “Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1793–1805, 2010
work page 2010
-
[21]
Speaker distance detection using a single microphone,
E. Georganti, T. May, S. van de Par, A. Harma, and J. Mourjopoulos, “Speaker distance detection using a single microphone,” IEEE Transac- tions on Audio, Speech, and Language Processing , vol. 19, no. 7, pp. 1949–1961, 2011
work page 1949
-
[22]
Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,
S. Chitreddy and P. Jackson, “Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,” in Forum Acusticum, 2020, pp. 2079–2086
work page 2020
-
[23]
T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 20, no. 10, pp. 2707–2720, 2012
work page 2012
-
[24]
L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,” in Speech Communication; 13th ITG-Symposium , 2018, pp. 1–5
work page 2018
-
[25]
D. Berghi and P. J. B. Jackson, “Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,” in Techical Report of DCASE Challenge , 2024
work page 2024
-
[26]
Auditory distance perception in rooms,
A. W. Bronkhorst and T. Houtgast, “Auditory distance perception in rooms,” Nature, vol. 397, pp. 517–520, 1999
work page 1999
-
[27]
Perception of reverberation in small rooms: A literature study,
N. Kaplanis, S. Bech, S. J. Holdt, and T. van Waterschoot, “Perception of reverberation in small rooms: A literature study,” in Audio Engineering Society Conference, 2014
work page 2014
-
[28]
Authoring inter- compatible flexible audio for mass personalization,
C. Cieciura, E. Bargiacchi, and P. J. B. Jackson, “Authoring inter- compatible flexible audio for mass personalization,” in The 157th Audio Engineering Society Convention , 2024
work page 2024
-
[29]
Presenting the S3A object-based audio drama dataset,
J. Woodcock, C. Pike, F. Melchior, P. Coleman, A. Franck, and A. Hilton, “Presenting the S3A object-based audio drama dataset,” in The 140th Audio Engineering Society Convention , 2016
work page 2016
-
[30]
SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,
C. Cieciura, M. V olino, and P. J. B. Jackson, “SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,” in The 154th Audio Engineering Society Convention , 2023
work page 2023
-
[31]
Q. Wang, J. Du, H.-X. Wu, J. Pan, F. Ma, and C.-H. Lee, “A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 1251–1264, 2023
work page 2023
-
[32]
L. Xue, H. Liu, Y . Zhou, and L. Gan, “Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,” inInternational Conference on Wireless Communications and Signal Processing , 2023
work page 2023
-
[33]
Conformer: Convolution- augmented transformer for speech recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040
work page 2020
-
[34]
A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,” Detection and Classification of Acoustic Scenes and Events Workshop , 2022
work page 2022
-
[35]
I. R. Roman, C. Ick, S. Ding, A. S. Roman, B. McFee, and J. P. Bello, “Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2024
work page 2024
-
[36]
The jackknife estimate of variance,
B. Efron and C. Stein, “The jackknife estimate of variance,” The Annals of Statistics, vol. 9, no. 3, pp. 586–596, 1981
work page 1981
-
[37]
Auditory distance percep- tion in humans: A summary of past and present research,
P. Zahorik, D. Brungart, and A. Bronkhorst, “Auditory distance percep- tion in humans: A summary of past and present research,” Acta Acustica United With Acustica , vol. 91, pp. 409–420, 2005
work page 2005
-
[38]
A. Kolarik, B. Moore, P. Zahorik, S. Cirstea, and S. Pardhan, “Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,” Atten Percept Psychophys , vol. 78, p. 373–395, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.