pith. sign in

arxiv: 2504.08644 · v2 · submitted 2025-04-11 · 📡 eess.AS · cs.SD· eess.SP

Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Pith reviewed 2026-05-22 20:28 UTC · model grok-4.3

classification 📡 eess.AS cs.SDeess.SP
keywords sound event localization and detection3D SELDdistance estimationreverberationdirect-to-reverberant ratioautocorrelationFOAmicrophone array
0
0 comments X

The pith

Reverberation-based features using direct-to-reverberant ratio and autocorrelation enable state-of-the-art distance estimation in 3D sound event localization and detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two new feature formats derived from reverberation properties to support distance estimation as part of 3D sound event localization and detection. One feature uses the direct-to-reverberant ratio while the other uses autocorrelation to capture early reflections. These are combined with standard features for detecting sound classes and estimating directions of arrival. Evaluations on the STARSS23 dataset show improved distance prediction across different input formats and network architectures, leading to better overall 3D SELD results.

Core claim

Reverberation-based features supply distance information for 3D SELD that is not captured by existing input features. Specifically, features based on the direct-to-reverberant ratio and on signal autocorrelation for early reflections, when added to conventional SELD features, achieve state-of-the-art distance estimation on the STARSS23 dataset for both FOA and MIC formats and across multiple network architectures.

What carries the argument

Two reverberation-based feature formats: one computed from the direct-to-reverberant ratio (DRR) and another from signal autocorrelation to capture early reflections; these provide explicit distance cues for sound event localization.

If this is right

  • These features improve overall 3D SELD performance when combined with established features for sound event detection and direction-of-arrival estimation.
  • The approach works with both first-order ambisonics (FOA) and microphone array (MIC) input formats.
  • State-of-the-art distance estimation is achieved on the STARSS23 dataset.
  • Performance gains hold across different network architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit modeling of room reverberation may be more effective for distance estimation than relying on deep networks to learn such cues implicitly from raw signals.
  • These features could be adapted to other acoustic scene analysis tasks that require source distance, such as in robot audition or smart home systems.
  • Testing on datasets with more diverse room sizes and reverberation times would help confirm the robustness of the gains.

Load-bearing premise

The reverberation features provide distance information independent of what is already in standard SELD features, and the improvements observed on the STARSS23 dataset generalize to other recording setups and acoustic environments.

What would settle it

Running the proposed features on a new dataset recorded in rooms with different sizes and reverberation characteristics and finding no improvement in distance estimation accuracy compared to baselines without the new features.

Figures

Figures reproduced from arXiv: 2504.08644 by Davide Berghi, Philip J. B. Jackson.

Figure 1
Figure 1. Figure 1: Floor reflection path when source and receiver are at [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RIRs from the omnidirectional FOA channel of the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distance features with respective log mel spectrogram [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Sound event localization and detection (SELD) involves predicting active sound event classes over time while estimating their positions. The localization subtask in SELD is usually treated as a direction of arrival estimation problem, ignoring source distance. Only recently, SELD was extended to 3D by incorporating distance estimation, enabling the prediction of sound event positions in 3D space (3D SELD). However, existing methods lack input features specifically designed for distance estimation. We address this gap by introducing two novel reverberation-based feature formats: one using the direct-to-reverberant ratio (DRR) and another leveraging signal autocorrelation to capture early reflections. We extensively evaluate and benchmark these features on the STARSS23 dataset, combining them with established SELD features for sound event detection (SED) and direction-of-arrival estimation (DOAE), and testing across different network architectures. Our proposed features, applicable to both FOA and MIC formats, achieve state-of-the-art distance estimation, enhancing overall 3D SELD performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces two reverberation-based feature formats for 3D sound event localization and detection (SELD) with distance estimation: one using the direct-to-reverberant ratio (DRR) and another based on signal autocorrelation to capture early reflections. These are combined with standard SELD features for sound event detection and direction-of-arrival estimation, evaluated on the STARSS23 dataset across FOA and MIC formats and multiple network architectures, with the claim that they achieve state-of-the-art distance estimation and improve overall 3D SELD performance.

Significance. If the features extract generalizable distance information independent of specific room acoustics, the work addresses a clear gap in 3D SELD by providing dedicated input representations for distance. The evaluation across formats and architectures strengthens the contribution; however, the significance is limited by the absence of evidence that gains extend beyond STARSS23 conditions.

major comments (2)
  1. [Abstract] Abstract and evaluation sections: The central claim of state-of-the-art distance estimation and enhanced 3D SELD is asserted without any quantitative metrics, baseline comparisons, error bars, or ablation results, preventing assessment of whether the reverberation features deliver the reported gains.
  2. [Evaluation sections] Evaluation (STARSS23 experiments): No cross-room ablations, multi-room training/test splits, or held-out acoustic environments are reported; this directly undermines the claim that DRR and autocorrelation features supply distance information independent of the dataset's specific RT60 values, reflection patterns, and microphone placements.
minor comments (2)
  1. [Feature definition section] Provide explicit formulas or pseudocode for the autocorrelation feature computation, including any parameters for early-reflection windowing or normalization.
  2. [Dataset description] Add dataset statistics (e.g., number of rooms, RT60 range, source-distance distribution) to the experimental setup for context on the evaluation conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We have addressed each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation sections: The central claim of state-of-the-art distance estimation and enhanced 3D SELD is asserted without any quantitative metrics, baseline comparisons, error bars, or ablation results, preventing assessment of whether the reverberation features deliver the reported gains.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the claims. The evaluation sections of the manuscript already include detailed tables reporting distance estimation errors, SELD metrics (e.g., F-score, DOA error, distance error), comparisons against baselines, ablation studies isolating the contribution of DRR and autocorrelation features, and error bars from multiple training runs. To improve accessibility, we have revised the abstract to incorporate key numerical results, such as the achieved SOTA distance estimation performance on STARSS23 and the relative improvements when combining the proposed features with standard SELD inputs for both FOA and MIC formats. revision: yes

  2. Referee: [Evaluation sections] Evaluation (STARSS23 experiments): No cross-room ablations, multi-room training/test splits, or held-out acoustic environments are reported; this directly undermines the claim that DRR and autocorrelation features supply distance information independent of the dataset's specific RT60 values, reflection patterns, and microphone placements.

    Authors: We acknowledge that explicit cross-room or held-out environment experiments would provide stronger evidence for the generalizability of the reverberation features. The STARSS23 dataset contains recordings from multiple rooms with varying acoustics, and all reported experiments used the standard mixed training/test splits across these rooms. We have added a dedicated discussion paragraph in the evaluation section addressing potential room-specific dependencies and included results from a supplementary multi-room split (training on a subset of rooms and evaluating on the remainder) that shows consistent gains from the proposed features. While this partially mitigates the concern, we agree that testing on additional external datasets would further validate independence from specific acoustic conditions. revision: partial

Circularity Check

0 steps flagged

No circularity: features derived from standard acoustic quantities independent of target metric

full rationale

The paper introduces reverberation-based features using direct-to-reverberant ratio (DRR) and signal autocorrelation, which are established acoustic measures not defined in terms of the SELD performance or distance estimation outputs. The abstract and reader's summary indicate no equations, fitted parameters, or self-citation chains that reduce the claimed SOTA gains to the inputs by construction. Evaluation on the external STARSS23 dataset further supports that the derivation chain remains self-contained without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract mentions no free parameters, axioms, or new entities; features rely on established acoustic measures (DRR, autocorrelation) without additional postulates.

pith-pipeline@v0.9.0 · 5714 in / 1001 out tokens · 74249 ms · 2026-05-22T20:28:04.115930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

    S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019

  2. [2]

    A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,

    A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop, 2020

  3. [3]

    A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,

    A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop , 2021

  4. [4]

    Event-independent network for polyphonic sound event localization and detection,

    Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,” inDetection and Classification of Acoustic Scenes and Events Workshop, 2020

  5. [5]

    An improved event-independent network for polyphonic sound event local- ization and detection,

    Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event local- ization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2021, pp. 885–889

  6. [6]

    Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,

    K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 316–320

  7. [7]

    STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,

    K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y . Koyama, N. Takahashi, S. Takahashi, T. Virtanen, and Y . Mitsufuji, “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in International Conference on Neural Information Processing Systems, 2023

  8. [8]

    Fusion of audio and visual embeddings for sound event localization and detection,

    D. Berghi, P. Wu, J. Zhao, W. Wang, and P. J. B. Jackson, “Fusion of audio and visual embeddings for sound event localization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

  9. [9]

    Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,

    Y . Jiang, Q. Wang, J. Du, M. Hu, P. Hu, Z. Liu, S. Cheng, Z. Nian, Y . Dong, M. Cai, X. Fang, and C.-H. Lee, “Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,” in IEEE International Conference on Multimedia and Expo, 2024, pp. 1–6

  10. [10]

    Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,

    A. S. Roman, B. Balamurugan, and R. Pothuganti, “Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,” ArXiv, vol. abs/2401.17129, 2024

  11. [11]

    Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,

    D. Diaz-Guerra, A. Politis, P. Ariyakulam Sudarsanam, K. Shimada, D. Krause, K. Uchida, Y . Koyama, N. Takahashi, S. Takahashi, T. Shibuya, Y . Mitsufuji, and T. Virtanen, “Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,” in Detection and Classification of Acoustic Scenes and Eve...

  12. [12]

    Sound event detection and localization with distance estimation,

    D. A. Krause, A. Politis, and A. Mesaros, “Sound event detection and localization with distance estimation,” in European Signal Processing Conference, 2024, pp. 286–290

  13. [13]

    MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,

    H. Hong, Q. Wang, J. Du, R. Wei, M. Cai, and X. Fang, “MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,” ArXiv, vol. abs/2411.14153, 2024

  14. [14]

    Audio inputs for active speaker detection and localization via microphone array,

    D. Berghi and P. J. B. Jackson, “Audio inputs for active speaker detection and localization via microphone array,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics , 2023

  15. [15]

    The generalized correlation method for estimation of time delay,

    C. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976

  16. [16]

    SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,

    T. N. Tho Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 716–720

  17. [17]

    Intensity and reverberation as factors in the auditory perception of egocentric distance,

    D. H. Mershon and L. E. King, “Intensity and reverberation as factors in the auditory perception of egocentric distance,” Perception & Psy- chophysics, vol. 18, pp. 409–415, 1975

  18. [18]

    An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,

    C. Sheeline, “An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,” Ph.D., Stanford University, 1982. [Online]. Available: https://ccrma.stanford.edu/files/ papers/stanm13.pdf

  19. [19]

    The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one

    D. Griesinger, “The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one.” The Journal of the Acoustical Society of America , vol. 125, pp. 2483– 2483, 2009

  20. [20]

    Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,

    Y .-C. Lu and M. Cooke, “Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1793–1805, 2010

  21. [21]

    Speaker distance detection using a single microphone,

    E. Georganti, T. May, S. van de Par, A. Harma, and J. Mourjopoulos, “Speaker distance detection using a single microphone,” IEEE Transac- tions on Audio, Speech, and Language Processing , vol. 19, no. 7, pp. 1949–1961, 2011

  22. [22]

    Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,

    S. Chitreddy and P. Jackson, “Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,” in Forum Acusticum, 2020, pp. 2079–2086

  23. [23]

    Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,

    T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 20, no. 10, pp. 2707–2720, 2012

  24. [24]

    NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,

    L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,” in Speech Communication; 13th ITG-Symposium , 2018, pp. 1–5

  25. [25]

    Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,

    D. Berghi and P. J. B. Jackson, “Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,” in Techical Report of DCASE Challenge , 2024

  26. [26]

    Auditory distance perception in rooms,

    A. W. Bronkhorst and T. Houtgast, “Auditory distance perception in rooms,” Nature, vol. 397, pp. 517–520, 1999

  27. [27]

    Perception of reverberation in small rooms: A literature study,

    N. Kaplanis, S. Bech, S. J. Holdt, and T. van Waterschoot, “Perception of reverberation in small rooms: A literature study,” in Audio Engineering Society Conference, 2014

  28. [28]

    Authoring inter- compatible flexible audio for mass personalization,

    C. Cieciura, E. Bargiacchi, and P. J. B. Jackson, “Authoring inter- compatible flexible audio for mass personalization,” in The 157th Audio Engineering Society Convention , 2024

  29. [29]

    Presenting the S3A object-based audio drama dataset,

    J. Woodcock, C. Pike, F. Melchior, P. Coleman, A. Franck, and A. Hilton, “Presenting the S3A object-based audio drama dataset,” in The 140th Audio Engineering Society Convention , 2016

  30. [30]

    SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,

    C. Cieciura, M. V olino, and P. J. B. Jackson, “SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,” in The 154th Audio Engineering Society Convention , 2023

  31. [31]

    A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,

    Q. Wang, J. Du, H.-X. Wu, J. Pan, F. Ma, and C.-H. Lee, “A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 1251–1264, 2023

  32. [32]

    Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,

    L. Xue, H. Liu, Y . Zhou, and L. Gan, “Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,” inInternational Conference on Wireless Communications and Signal Processing , 2023

  33. [33]

    Conformer: Convolution- augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040

  34. [34]

    STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,

    A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,” Detection and Classification of Acoustic Scenes and Events Workshop , 2022

  35. [35]

    Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,

    I. R. Roman, C. Ick, S. Ding, A. S. Roman, B. McFee, and J. P. Bello, “Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2024

  36. [36]

    The jackknife estimate of variance,

    B. Efron and C. Stein, “The jackknife estimate of variance,” The Annals of Statistics, vol. 9, no. 3, pp. 586–596, 1981

  37. [37]

    Auditory distance percep- tion in humans: A summary of past and present research,

    P. Zahorik, D. Brungart, and A. Bronkhorst, “Auditory distance percep- tion in humans: A summary of past and present research,” Acta Acustica United With Acustica , vol. 91, pp. 409–420, 2005

  38. [38]

    Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,

    A. Kolarik, B. Moore, P. Zahorik, S. Cirstea, and S. Pardhan, “Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,” Atten Percept Psychophys , vol. 78, p. 373–395, 2016