Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Davide Berghi; Philip J. B. Jackson

arxiv: 2504.08644 · v2 · submitted 2025-04-11 · 📡 eess.AS · cs.SD· eess.SP

Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Davide Berghi , Philip J. B. Jackson This is my paper

Pith reviewed 2026-05-22 20:28 UTC · model grok-4.3

classification 📡 eess.AS cs.SDeess.SP

keywords sound event localization and detection3D SELDdistance estimationreverberationdirect-to-reverberant ratioautocorrelationFOAmicrophone array

0 comments

The pith

Reverberation-based features using direct-to-reverberant ratio and autocorrelation enable state-of-the-art distance estimation in 3D sound event localization and detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two new feature formats derived from reverberation properties to support distance estimation as part of 3D sound event localization and detection. One feature uses the direct-to-reverberant ratio while the other uses autocorrelation to capture early reflections. These are combined with standard features for detecting sound classes and estimating directions of arrival. Evaluations on the STARSS23 dataset show improved distance prediction across different input formats and network architectures, leading to better overall 3D SELD results.

Core claim

Reverberation-based features supply distance information for 3D SELD that is not captured by existing input features. Specifically, features based on the direct-to-reverberant ratio and on signal autocorrelation for early reflections, when added to conventional SELD features, achieve state-of-the-art distance estimation on the STARSS23 dataset for both FOA and MIC formats and across multiple network architectures.

What carries the argument

Two reverberation-based feature formats: one computed from the direct-to-reverberant ratio (DRR) and another from signal autocorrelation to capture early reflections; these provide explicit distance cues for sound event localization.

If this is right

These features improve overall 3D SELD performance when combined with established features for sound event detection and direction-of-arrival estimation.
The approach works with both first-order ambisonics (FOA) and microphone array (MIC) input formats.
State-of-the-art distance estimation is achieved on the STARSS23 dataset.
Performance gains hold across different network architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit modeling of room reverberation may be more effective for distance estimation than relying on deep networks to learn such cues implicitly from raw signals.
These features could be adapted to other acoustic scene analysis tasks that require source distance, such as in robot audition or smart home systems.
Testing on datasets with more diverse room sizes and reverberation times would help confirm the robustness of the gains.

Load-bearing premise

The reverberation features provide distance information independent of what is already in standard SELD features, and the improvements observed on the STARSS23 dataset generalize to other recording setups and acoustic environments.

What would settle it

Running the proposed features on a new dataset recorded in rooms with different sizes and reverberation characteristics and finding no improvement in distance estimation accuracy compared to baselines without the new features.

Figures

Figures reproduced from arXiv: 2504.08644 by Davide Berghi, Philip J. B. Jackson.

**Figure 2.** Figure 2: RIRs from the omnidirectional FOA channel of the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Distance features with respective log mel spectrogram [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Sound event localization and detection (SELD) involves predicting active sound event classes over time while estimating their positions. The localization subtask in SELD is usually treated as a direction of arrival estimation problem, ignoring source distance. Only recently, SELD was extended to 3D by incorporating distance estimation, enabling the prediction of sound event positions in 3D space (3D SELD). However, existing methods lack input features specifically designed for distance estimation. We address this gap by introducing two novel reverberation-based feature formats: one using the direct-to-reverberant ratio (DRR) and another leveraging signal autocorrelation to capture early reflections. We extensively evaluate and benchmark these features on the STARSS23 dataset, combining them with established SELD features for sound event detection (SED) and direction-of-arrival estimation (DOAE), and testing across different network architectures. Our proposed features, applicable to both FOA and MIC formats, achieve state-of-the-art distance estimation, enhancing overall 3D SELD performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reverberation features improve distance estimation in 3D SELD on STARSS23 but risk capturing dataset-specific acoustics instead of general distance cues.

read the letter

This paper introduces reverberation-based features to add distance estimation to sound event localization and detection. The authors propose two formats, one based on the direct-to-reverberant ratio and another on signal autocorrelation for early reflections. They show these work with both FOA and microphone array inputs and improve overall 3D SELD on the STARSS23 dataset. What the work does well is fill a clear gap. Previous SELD methods focused on class and direction but left out distance, which limits use in real 3D spaces like robotics. By designing features specifically for distance and testing them across network architectures, the paper gives a practical way to extend existing pipelines. The fact that they benchmark on standard formats makes it easy for others to adopt. The soft spots are around validation. The claims of state-of-the-art distance estimation are stated without specific numbers or error bars in the abstract, so the size of the gain is hard to assess from the summary alone. More importantly, the features rely on reverberation properties that could be tied to the rooms and source distances in STARSS23. Without ablations on different acoustic conditions or held-out environments, it's possible the improvements are partly from matching the training data's acoustics rather than learning robust distance cues. That matches the stress-test concern, and it looks like a real issue given the lack of cross-room testing mentioned. This kind of paper is for people already working on spatial audio tasks in signal processing or machine learning. A reader building systems for sound-aware robots or virtual environments would get direct value from trying these features. It deserves a serious referee because it addresses an identified limitation with new inputs and reports results on a public dataset. I recommend putting it through peer review. The core idea is sound enough to warrant feedback on the experiments and generalization.

Referee Report

2 major / 2 minor

Summary. The paper introduces two reverberation-based feature formats for 3D sound event localization and detection (SELD) with distance estimation: one using the direct-to-reverberant ratio (DRR) and another based on signal autocorrelation to capture early reflections. These are combined with standard SELD features for sound event detection and direction-of-arrival estimation, evaluated on the STARSS23 dataset across FOA and MIC formats and multiple network architectures, with the claim that they achieve state-of-the-art distance estimation and improve overall 3D SELD performance.

Significance. If the features extract generalizable distance information independent of specific room acoustics, the work addresses a clear gap in 3D SELD by providing dedicated input representations for distance. The evaluation across formats and architectures strengthens the contribution; however, the significance is limited by the absence of evidence that gains extend beyond STARSS23 conditions.

major comments (2)

[Abstract] Abstract and evaluation sections: The central claim of state-of-the-art distance estimation and enhanced 3D SELD is asserted without any quantitative metrics, baseline comparisons, error bars, or ablation results, preventing assessment of whether the reverberation features deliver the reported gains.
[Evaluation sections] Evaluation (STARSS23 experiments): No cross-room ablations, multi-room training/test splits, or held-out acoustic environments are reported; this directly undermines the claim that DRR and autocorrelation features supply distance information independent of the dataset's specific RT60 values, reflection patterns, and microphone placements.

minor comments (2)

[Feature definition section] Provide explicit formulas or pseudocode for the autocorrelation feature computation, including any parameters for early-reflection windowing or normalization.
[Dataset description] Add dataset statistics (e.g., number of rooms, RT60 range, source-distance distribution) to the experimental setup for context on the evaluation conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We have addressed each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation sections: The central claim of state-of-the-art distance estimation and enhanced 3D SELD is asserted without any quantitative metrics, baseline comparisons, error bars, or ablation results, preventing assessment of whether the reverberation features deliver the reported gains.

Authors: We agree that the abstract would benefit from explicit quantitative support for the claims. The evaluation sections of the manuscript already include detailed tables reporting distance estimation errors, SELD metrics (e.g., F-score, DOA error, distance error), comparisons against baselines, ablation studies isolating the contribution of DRR and autocorrelation features, and error bars from multiple training runs. To improve accessibility, we have revised the abstract to incorporate key numerical results, such as the achieved SOTA distance estimation performance on STARSS23 and the relative improvements when combining the proposed features with standard SELD inputs for both FOA and MIC formats. revision: yes
Referee: [Evaluation sections] Evaluation (STARSS23 experiments): No cross-room ablations, multi-room training/test splits, or held-out acoustic environments are reported; this directly undermines the claim that DRR and autocorrelation features supply distance information independent of the dataset's specific RT60 values, reflection patterns, and microphone placements.

Authors: We acknowledge that explicit cross-room or held-out environment experiments would provide stronger evidence for the generalizability of the reverberation features. The STARSS23 dataset contains recordings from multiple rooms with varying acoustics, and all reported experiments used the standard mixed training/test splits across these rooms. We have added a dedicated discussion paragraph in the evaluation section addressing potential room-specific dependencies and included results from a supplementary multi-room split (training on a subset of rooms and evaluating on the remainder) that shows consistent gains from the proposed features. While this partially mitigates the concern, we agree that testing on additional external datasets would further validate independence from specific acoustic conditions. revision: partial

Circularity Check

0 steps flagged

No circularity: features derived from standard acoustic quantities independent of target metric

full rationale

The paper introduces reverberation-based features using direct-to-reverberant ratio (DRR) and signal autocorrelation, which are established acoustic measures not defined in terms of the SELD performance or distance estimation outputs. The abstract and reader's summary indicate no equations, fitted parameters, or self-citation chains that reduce the claimed SOTA gains to the inputs by construction. Evaluation on the external STARSS23 dataset further supports that the derivation chain remains self-contained without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract mentions no free parameters, axioms, or new entities; features rely on established acoustic measures (DRR, autocorrelation) without additional postulates.

pith-pipeline@v0.9.0 · 5714 in / 1001 out tokens · 74249 ms · 2026-05-22T20:28:04.115930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019

work page 2019
[2]

A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,

A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop, 2020

work page 2020
[3]

A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,

A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop , 2021

work page 2021
[4]

Event-independent network for polyphonic sound event localization and detection,

Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,” inDetection and Classification of Acoustic Scenes and Events Workshop, 2020

work page 2020
[5]

An improved event-independent network for polyphonic sound event local- ization and detection,

Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event local- ization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2021, pp. 885–889

work page 2021
[6]

Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,

K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 316–320

work page 2022
[7]

STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,

K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y . Koyama, N. Takahashi, S. Takahashi, T. Virtanen, and Y . Mitsufuji, “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in International Conference on Neural Information Processing Systems, 2023

work page 2023
[8]

Fusion of audio and visual embeddings for sound event localization and detection,

D. Berghi, P. Wu, J. Zhao, W. Wang, and P. J. B. Jackson, “Fusion of audio and visual embeddings for sound event localization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

work page 2024
[9]

Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,

Y . Jiang, Q. Wang, J. Du, M. Hu, P. Hu, Z. Liu, S. Cheng, Z. Nian, Y . Dong, M. Cai, X. Fang, and C.-H. Lee, “Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,” in IEEE International Conference on Multimedia and Expo, 2024, pp. 1–6

work page 2024
[10]

Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,

A. S. Roman, B. Balamurugan, and R. Pothuganti, “Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,” ArXiv, vol. abs/2401.17129, 2024

work page arXiv 2024
[11]

Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,

D. Diaz-Guerra, A. Politis, P. Ariyakulam Sudarsanam, K. Shimada, D. Krause, K. Uchida, Y . Koyama, N. Takahashi, S. Takahashi, T. Shibuya, Y . Mitsufuji, and T. Virtanen, “Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,” in Detection and Classification of Acoustic Scenes and Eve...

work page 2024
[12]

Sound event detection and localization with distance estimation,

D. A. Krause, A. Politis, and A. Mesaros, “Sound event detection and localization with distance estimation,” in European Signal Processing Conference, 2024, pp. 286–290

work page 2024
[13]

MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,

H. Hong, Q. Wang, J. Du, R. Wei, M. Cai, and X. Fang, “MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,” ArXiv, vol. abs/2411.14153, 2024

work page arXiv 2024
[14]

Audio inputs for active speaker detection and localization via microphone array,

D. Berghi and P. J. B. Jackson, “Audio inputs for active speaker detection and localization via microphone array,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics , 2023

work page 2023
[15]

The generalized correlation method for estimation of time delay,

C. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976

work page 1976
[16]

SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,

T. N. Tho Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 716–720

work page 2022
[17]

Intensity and reverberation as factors in the auditory perception of egocentric distance,

D. H. Mershon and L. E. King, “Intensity and reverberation as factors in the auditory perception of egocentric distance,” Perception & Psy- chophysics, vol. 18, pp. 409–415, 1975

work page 1975
[18]

An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,

C. Sheeline, “An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,” Ph.D., Stanford University, 1982. [Online]. Available: https://ccrma.stanford.edu/files/ papers/stanm13.pdf

work page 1982
[19]

The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one

D. Griesinger, “The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one.” The Journal of the Acoustical Society of America , vol. 125, pp. 2483– 2483, 2009

work page 2009
[20]

Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,

Y .-C. Lu and M. Cooke, “Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1793–1805, 2010

work page 2010
[21]

Speaker distance detection using a single microphone,

E. Georganti, T. May, S. van de Par, A. Harma, and J. Mourjopoulos, “Speaker distance detection using a single microphone,” IEEE Transac- tions on Audio, Speech, and Language Processing , vol. 19, no. 7, pp. 1949–1961, 2011

work page 1949
[22]

Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,

S. Chitreddy and P. Jackson, “Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,” in Forum Acusticum, 2020, pp. 2079–2086

work page 2020
[23]

Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,

T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 20, no. 10, pp. 2707–2720, 2012

work page 2012
[24]

NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,

L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,” in Speech Communication; 13th ITG-Symposium , 2018, pp. 1–5

work page 2018
[25]

Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,

D. Berghi and P. J. B. Jackson, “Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,” in Techical Report of DCASE Challenge , 2024

work page 2024
[26]

Auditory distance perception in rooms,

A. W. Bronkhorst and T. Houtgast, “Auditory distance perception in rooms,” Nature, vol. 397, pp. 517–520, 1999

work page 1999
[27]

Perception of reverberation in small rooms: A literature study,

N. Kaplanis, S. Bech, S. J. Holdt, and T. van Waterschoot, “Perception of reverberation in small rooms: A literature study,” in Audio Engineering Society Conference, 2014

work page 2014
[28]

Authoring inter- compatible flexible audio for mass personalization,

C. Cieciura, E. Bargiacchi, and P. J. B. Jackson, “Authoring inter- compatible flexible audio for mass personalization,” in The 157th Audio Engineering Society Convention , 2024

work page 2024
[29]

Presenting the S3A object-based audio drama dataset,

J. Woodcock, C. Pike, F. Melchior, P. Coleman, A. Franck, and A. Hilton, “Presenting the S3A object-based audio drama dataset,” in The 140th Audio Engineering Society Convention , 2016

work page 2016
[30]

SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,

C. Cieciura, M. V olino, and P. J. B. Jackson, “SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,” in The 154th Audio Engineering Society Convention , 2023

work page 2023
[31]

A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,

Q. Wang, J. Du, H.-X. Wu, J. Pan, F. Ma, and C.-H. Lee, “A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 1251–1264, 2023

work page 2023
[32]

Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,

L. Xue, H. Liu, Y . Zhou, and L. Gan, “Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,” inInternational Conference on Wireless Communications and Signal Processing , 2023

work page 2023
[33]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040

work page 2020
[34]

STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,

A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,” Detection and Classification of Acoustic Scenes and Events Workshop , 2022

work page 2022
[35]

Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,

I. R. Roman, C. Ick, S. Ding, A. S. Roman, B. McFee, and J. P. Bello, “Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2024

work page 2024
[36]

The jackknife estimate of variance,

B. Efron and C. Stein, “The jackknife estimate of variance,” The Annals of Statistics, vol. 9, no. 3, pp. 586–596, 1981

work page 1981
[37]

Auditory distance percep- tion in humans: A summary of past and present research,

P. Zahorik, D. Brungart, and A. Bronkhorst, “Auditory distance percep- tion in humans: A summary of past and present research,” Acta Acustica United With Acustica , vol. 91, pp. 409–420, 2005

work page 2005
[38]

Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,

A. Kolarik, B. Moore, P. Zahorik, S. Cirstea, and S. Pardhan, “Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,” Atten Percept Psychophys , vol. 78, p. 373–395, 2016

work page 2016

[1] [1]

Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019

work page 2019

[2] [2]

A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,

A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop, 2020

work page 2020

[3] [3]

A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,

A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop , 2021

work page 2021

[4] [4]

Event-independent network for polyphonic sound event localization and detection,

Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,” inDetection and Classification of Acoustic Scenes and Events Workshop, 2020

work page 2020

[5] [5]

An improved event-independent network for polyphonic sound event local- ization and detection,

Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event local- ization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2021, pp. 885–889

work page 2021

[6] [6]

Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,

K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 316–320

work page 2022

[7] [7]

STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,

K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y . Koyama, N. Takahashi, S. Takahashi, T. Virtanen, and Y . Mitsufuji, “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in International Conference on Neural Information Processing Systems, 2023

work page 2023

[8] [8]

Fusion of audio and visual embeddings for sound event localization and detection,

D. Berghi, P. Wu, J. Zhao, W. Wang, and P. J. B. Jackson, “Fusion of audio and visual embeddings for sound event localization and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

work page 2024

[9] [9]

Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,

Y . Jiang, Q. Wang, J. Du, M. Hu, P. Hu, Z. Liu, S. Cheng, Z. Nian, Y . Dong, M. Cai, X. Fang, and C.-H. Lee, “Exploring audio-visual infor- mation fusion for sound event localization and detection in low-resource realistic scenarios,” in IEEE International Conference on Multimedia and Expo, 2024, pp. 1–6

work page 2024

[10] [10]

Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,

A. S. Roman, B. Balamurugan, and R. Pothuganti, “Enhanced sound event localization and detection in real 360-degree audio-visual sound- scapes,” ArXiv, vol. abs/2401.17129, 2024

work page arXiv 2024

[11] [11]

Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,

D. Diaz-Guerra, A. Politis, P. Ariyakulam Sudarsanam, K. Shimada, D. Krause, K. Uchida, Y . Koyama, N. Takahashi, S. Takahashi, T. Shibuya, Y . Mitsufuji, and T. Virtanen, “Baseline models and evalu- ation of sound event localization and detection with distance estimation in DCASE 2024 Challenge,” in Detection and Classification of Acoustic Scenes and Eve...

work page 2024

[12] [12]

Sound event detection and localization with distance estimation,

D. A. Krause, A. Politis, and A. Mesaros, “Sound event detection and localization with distance estimation,” in European Signal Processing Conference, 2024, pp. 286–290

work page 2024

[13] [13]

MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,

H. Hong, Q. Wang, J. Du, R. Wei, M. Cai, and X. Fang, “MV ANet: Multi-stage video attention network for sound event localization and detection with source distance estimation,” ArXiv, vol. abs/2411.14153, 2024

work page arXiv 2024

[14] [14]

Audio inputs for active speaker detection and localization via microphone array,

D. Berghi and P. J. B. Jackson, “Audio inputs for active speaker detection and localization via microphone array,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics , 2023

work page 2023

[15] [15]

The generalized correlation method for estimation of time delay,

C. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976

work page 1976

[16] [16]

SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,

T. N. Tho Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022, pp. 716–720

work page 2022

[17] [17]

Intensity and reverberation as factors in the auditory perception of egocentric distance,

D. H. Mershon and L. E. King, “Intensity and reverberation as factors in the auditory perception of egocentric distance,” Perception & Psy- chophysics, vol. 18, pp. 409–415, 1975

work page 1975

[18] [18]

An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,

C. Sheeline, “An investigation of the effects of direct and reverberant signal interactions on auditory distance perception,” Ph.D., Stanford University, 1982. [Online]. Available: https://ccrma.stanford.edu/files/ papers/stanm13.pdf

work page 1982

[19] [19]

The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one

D. Griesinger, “The importance of the direct to reverberant ratio in the perception of distance, localization, clarity, and envelopment, part one.” The Journal of the Acoustical Society of America , vol. 125, pp. 2483– 2483, 2009

work page 2009

[20] [20]

Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,

Y .-C. Lu and M. Cooke, “Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1793–1805, 2010

work page 2010

[21] [21]

Speaker distance detection using a single microphone,

E. Georganti, T. May, S. van de Par, A. Harma, and J. Mourjopoulos, “Speaker distance detection using a single microphone,” IEEE Transac- tions on Audio, Speech, and Language Processing , vol. 19, no. 7, pp. 1949–1961, 2011

work page 1949

[22] [22]

Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,

S. Chitreddy and P. Jackson, “Source Distance Perception with Rever- berant Spatial Audio Object Reproduction of Real Rooms,” in Forum Acusticum, 2020, pp. 2079–2086

work page 2020

[23] [23]

Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,

T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 20, no. 10, pp. 2707–2720, 2012

work page 2012

[24] [24]

NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,

L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “NARA- WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing,” in Speech Communication; 13th ITG-Symposium , 2018, pp. 1–5

work page 2018

[25] [25]

Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,

D. Berghi and P. J. B. Jackson, “Leveraging reverberation and visual depth cues for sound event localization and detection with distance estimation,” in Techical Report of DCASE Challenge , 2024

work page 2024

[26] [26]

Auditory distance perception in rooms,

A. W. Bronkhorst and T. Houtgast, “Auditory distance perception in rooms,” Nature, vol. 397, pp. 517–520, 1999

work page 1999

[27] [27]

Perception of reverberation in small rooms: A literature study,

N. Kaplanis, S. Bech, S. J. Holdt, and T. van Waterschoot, “Perception of reverberation in small rooms: A literature study,” in Audio Engineering Society Conference, 2014

work page 2014

[28] [28]

Authoring inter- compatible flexible audio for mass personalization,

C. Cieciura, E. Bargiacchi, and P. J. B. Jackson, “Authoring inter- compatible flexible audio for mass personalization,” in The 157th Audio Engineering Society Convention , 2024

work page 2024

[29] [29]

Presenting the S3A object-based audio drama dataset,

J. Woodcock, C. Pike, F. Melchior, P. Coleman, A. Franck, and A. Hilton, “Presenting the S3A object-based audio drama dataset,” in The 140th Audio Engineering Society Convention , 2016

work page 2016

[30] [30]

SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,

C. Cieciura, M. V olino, and P. J. B. Jackson, “SurrRoom 1.0 Dataset: Spatial room capture with controlled acoustic and optical measure- ments,” in The 154th Audio Engineering Society Convention , 2023

work page 2023

[31] [31]

A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,

Q. Wang, J. Du, H.-X. Wu, J. Pan, F. Ma, and C.-H. Lee, “A four-stage data augmentation approach to resnet-conformer based acoustic model- ing for sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 1251–1264, 2023

work page 2023

[32] [32]

Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,

L. Xue, H. Liu, Y . Zhou, and L. Gan, “Resnet-conformer network using multi-scale channel attention for sound event localization and detection in real scenes,” inInternational Conference on Wireless Communications and Signal Processing , 2023

work page 2023

[33] [33]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040

work page 2020

[34] [34]

STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,

A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “STARSS22: A dataset of spatial recordings of real scenes with spa- tiotemporal annotations of sound events,” Detection and Classification of Acoustic Scenes and Events Workshop , 2022

work page 2022

[35] [35]

Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,

I. R. Roman, C. Ick, S. Ding, A. S. Roman, B. McFee, and J. P. Bello, “Spatial Scaper: A library to simulate and augment soundscapes for sound event localization and detection in realistic rooms,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2024

work page 2024

[36] [36]

The jackknife estimate of variance,

B. Efron and C. Stein, “The jackknife estimate of variance,” The Annals of Statistics, vol. 9, no. 3, pp. 586–596, 1981

work page 1981

[37] [37]

Auditory distance percep- tion in humans: A summary of past and present research,

P. Zahorik, D. Brungart, and A. Bronkhorst, “Auditory distance percep- tion in humans: A summary of past and present research,” Acta Acustica United With Acustica , vol. 91, pp. 409–420, 2005

work page 2005

[38] [38]

Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,

A. Kolarik, B. Moore, P. Zahorik, S. Cirstea, and S. Pardhan, “Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,” Atten Percept Psychophys , vol. 78, p. 373–395, 2016

work page 2016