pith. sign in

arxiv: 2606.27751 · v1 · pith:BM4N7MPSnew · submitted 2026-06-26 · 💻 cs.SD · cs.AI

From General-Purpose Audio Tagging to Spatially Grounded Sound Event Localization and Detection

Pith reviewed 2026-06-29 03:18 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords general-purpose audio taggingsound event localization and detectionfirst-order ambisonicsneural architecture searchpretrained modelsdirection of arrival estimationspatial audio processing
0
0 comments X

The pith

Pretrained general-purpose audio tagging models can support sound event localization and detection when coupled with first-order ambisonics descriptors and multi-stage architecture search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how semantic priors learned by general-purpose audio tagging models can be repurposed for the joint task of detecting and locating sound events in space. It introduces the AT2SELD framework that attaches a pretrained tagging backbone to compact first-order ambisonics spatial processing, track-wise detection, direction-of-arrival estimation, and calibration steps. Through three stages of neural architecture search it identifies which input descriptors and fusion points allow semantic knowledge to transfer into spatial reasoning under limited data and compute. Diagnostic tests across multiple datasets show that focal loss, activity-conditioned supervision, and threshold selection each improve different parts of the pipeline without replacing the learned spatial representations. The work concludes that such priors remain useful when the architecture explicitly separates and then recombines semantic and spatial pathways.

Core claim

Spectral first-order ambisonics descriptors based on magnitude, phase, and intensity vectors supply the most reliable interface for transferring semantic priors from a general-purpose audio tagging backbone into sound event localization and detection; early residual spatial encoding is the main capacity bottleneck while late cross-stitch fusion and recurrent smoothing act as refinement stages.

What carries the argument

AT2SELD framework that couples a pretrained audio-tagging backbone with first-order ambisonics spatial processing, track-wise SED, Cartesian DOA estimation, permutation-aware supervision, and calibration, discovered via multi-stage NAS.

If this is right

  • Spectral FOA magnitude-phase-intensity descriptors enable more reliable semantic-to-spatial transfer than alternative input representations.
  • Early residual spatial encoding is the component most sensitive to model capacity; late track-wise abstraction and recurrent smoothing mainly refine outputs.
  • Late cross-stitch coupling between semantic and spatial streams improves interaction more effectively than early fusion at lower computational cost.
  • Focal loss and activity-conditioned DOA supervision mitigate inactive-target dominance and improve the activity detection point without retraining spatial features.
  • Validation-selected thresholds recover calibration performance on new datasets while preserving the spatial learning already achieved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same descriptor interface may allow pretrained tagging models to initialize other spatial audio tasks such as source separation or acoustic mapping.
  • The identified split between early spatial encoding and late semantic refinement could be tested for transfer to non-ambisonics microphone arrays.
  • Integrated calibration steps suggest that deployment pipelines for SELD can be made more robust by treating threshold selection as a final validation stage rather than part of spatial training.

Load-bearing premise

Spectral first-order ambisonics descriptors based on magnitude, phase, and intensity vectors form a reliable and general interface for semantic-to-spatial knowledge transfer across datasets.

What would settle it

If the architecture selected by the three-stage NAS fails to outperform a strong non-pretrained baseline on a held-out dataset with new room acoustics or source distributions, the claim that GP-AT priors transfer usefully would be falsified.

Figures

Figures reproduced from arXiv: 2606.27751 by Claudia Rinaldi, Fabio Graziosi, Stefano Damiano, Stefano Giacomelli, Toon van Waterschoot.

Figure 1
Figure 1. Figure 1: Class-wise two-branch SELD output format, with separate SED activity estimates [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ACCDOA representation, where event activity is encoded by the norm of a [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-ACCDOA and ADPIT representation for same-class spatial overlap and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between class-wise SELD output formatting and track-wise event [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SELDnet model structure, with convolutional [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: U-style recurrent neural network for SELD, combining multi-scale convolutional pro￾cessing, skip connections, recurrent temporal mod￾eling, and parallel SED/DOA prediction branches, adapted from L. Pi et al., “U Recurrent Neural Network for Polyphonic Sound Event Detection and Localization” [18]. An early refinement of this CRNN paradigm is the U-style recurrent neu￾ral network, which preserves the same jo… view at source ↗
Figure 7
Figure 7. Figure 7: Schematic representation of GCC-PHAT-based time-delay estimation, based on [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of SELDnet tracking behavior on a moving-source scene, with reference [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: PI-RNN attention module for soft association [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: EIN-V1 model structure, with event-independent tracks and an auxiliary EAD [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: EIN-V2 model structure, with track-wise SELD prediction, MHSA-based track [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Residual block structure with identity shortcut, adapted from A. Zhang et al., “Dive into Deep Learning” [25]. Given an input feature map X, a residual block computes: Y = F(X; θ) + S(X) (36) where Y is the block output, F(·; θ) denotes the learned residual transfor￾mation parameterized by θ, and S(·) is the shortcut path, implemented either as an identity mapping or as a pro￾jection when the number of ch… view at source ↗
Figure 13
Figure 13. Figure 13: Conformer module structure, combining feed-forward, self-attention, convolu [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MSCA module for joint local and global channel recalibration, adapted from L. Xue et al., “Resnet-Conformer Network Using Multi-Scale Channel Attention for Sound Event Localization and Detection in Real Scenes” [22]. The MSCA block ( [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: SoundDet model structure, with raw-waveform feature extraction, framewise [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: SoundDet dense event proposal module, adapted from Y. He et al., “SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform” [34]. The learned waveform representation is processed by a 1D convolutional encoder– decoder backbone with skip connections. The encoder progressively reduces temporal resolution while increasing channel dimen￾sionality, whereas the decoder partially re￾s… view at source ↗
Figure 17
Figure 17. Figure 17: SoundDoA model structure, with learnable Gabor-domain front-end, semantic– [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: SoundDoA learnable front-end and enhancement module, adapted from Y. He and A. Markham, “SoundDoA: Learn Sound Source Di￾rection of Arrival and Semantics from Sound Raw Waveforms” [35]. The downstream SoundDoA ar￾chitecture differs from SoundDet in that it does not score dense tempo￾ral proposals. Instead, it uses two parallel sub-networks with layer-wise communication, one oriented toward semantic label … view at source ↗
Figure 19
Figure 19. Figure 19: DOANet model structure for differentiable [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: SRP-PHAT maps under favorable and challenging acoustic conditions, adapted [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Cross3D architecture for causal SRP-PHAT-map tracking, adapted from D. Díaz [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Neural-SRP architecture with shared pairwise encoding, microphone-coordinate [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: NAS Stage 1: shallow grid search over spatial front-end families, early spatial [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: NAS Stage 2: controlled depth allocation over the early spatial stage, late [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: NAS Stage 3: regularization and semantic–spatial interaction search through [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: NAS Stage 4: diagnostic characterization over data balancing [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: TAU2019 temporal polyphony examples: split0 is used as test set, split1 as validation set, and split2-3 as training set. A first layer of unification concerns the target interface. All datasets are converted to a track-wise Cartesian target tensor Y ∈ R T ×Ntracks×C×3 where T is the target temporal resolution, Ntracks is the maximum number of track slots, C is the class vocabulary of the corresponding cor… view at source ↗
Figure 28
Figure 28. Figure 28: STARSS23 and TAU-NIGENS2021 DOA marginal distributions. Event density [PITH_FULL_IMAGE:figures/full_fig_p049_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: TAU2019 DOA marginal distributions by class. Event density is represented by [PITH_FULL_IMAGE:figures/full_fig_p050_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: AT2SELD data pipeline, from configuration assembly and split construction to [PITH_FULL_IMAGE:figures/full_fig_p051_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Effect of the 16patterns augmentation on a fixed Cartesian position with azimuth 35◦ and elevation 20◦. The blue point denotes the original position, whereas the orange point denotes the transformed position. 53 [PITH_FULL_IMAGE:figures/full_fig_p056_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Effect of the 16patterns augmentation on a synthetic spatial trajectory. The blue curve denotes the original trajectory, whereas the orange curve denotes the transformed trajectory. 54 [PITH_FULL_IMAGE:figures/full_fig_p057_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Technical profiling of the candidate spatial modules. [PITH_FULL_IMAGE:figures/full_fig_p060_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Stage 1 AT2SELD results on STARSS23: validation SELD score trends for the [PITH_FULL_IMAGE:figures/full_fig_p062_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Stage 1 AT2SELD shallow-grid results on STARSS23: training-loss trends. [PITH_FULL_IMAGE:figures/full_fig_p063_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Stage 2 AT2SELD validation SELD trends for the [PITH_FULL_IMAGE:figures/full_fig_p065_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Stage 2 AT2SELD validation SELD score trends on STARSS23 [PITH_FULL_IMAGE:figures/full_fig_p066_37.png] view at source ↗
Figure 39
Figure 39. Figure 39: Regularized ResNetBlock used in Stage 3. The skip path represents the residual shortcut added before the final activation and spatial dropout. The dropout probability is set to 0.4, so that regularization acts on entire time–frequency feature maps rather than on independent scalar activations. In the late abstraction stage, dropout with probability 0.3 is applied inside the TrackTransformer encoder layer,… view at source ↗
Figure 40
Figure 40. Figure 40: Stage 3 metric deltas with respect to the regularized no-stitch baseline. Values [PITH_FULL_IMAGE:figures/full_fig_p069_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Stage 3 validation loss-component curves. [PITH_FULL_IMAGE:figures/full_fig_p070_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Stage 3 validation SELD score curves. Red circles denote best-score checkpoints. This result supports a position-dependent interpretation of semantic–spatial interac￾tion. The late bridge acts after the spatial stream has already performed time–frequency organization and track-wise abstraction, so semantic evidence can be injected as high-level conditioning. The early bridge instead acts while the spatial… view at source ↗
Figure 43
Figure 43. Figure 43: Stage 3 auxiliary semantic-presence per-class F1 scores. [PITH_FULL_IMAGE:figures/full_fig_p072_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: STARSS23 split distributions used in Stages 1–3. [PITH_FULL_IMAGE:figures/full_fig_p073_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Class distribution of BalancedSTARSS23Dataset after external projection, greedy clip selection, and mixed synthesis [PITH_FULL_IMAGE:figures/full_fig_p079_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: DOA marginal distributions of BalancedSTARSS23Dataset after external pro￾jection, greedy clip selection, and mixed synthesis. spatial-audio conditions. The balancing stage therefore establishes a clear separation be￾tween coverage and learnability: class exposure is substantially improved, but the following sections are required to determine whether the joint SELD objective can exploit that additional sup… view at source ↗
Figure 47
Figure 47. Figure 47: Coverage-aware SED and SELD F1 scores for the selected AT2SELD family. [PITH_FULL_IMAGE:figures/full_fig_p093_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Coverage-aware oracle-activity DOA accuracy at [PITH_FULL_IMAGE:figures/full_fig_p096_48.png] view at source ↗
read the original abstract

This report investigates the extension of pretrained General-Purpose Audio Tagging (GP-AT) models toward spatially grounded Sound Event Localization and Detection (SELD). The proposed AT2SELD framework couples a pretrained AT backbone with compact First-Order Ambisonics (FOA) spatial processing, track-wise SED and Cartesian DOA estimation, permutation aware supervision, and calibration. It characterizes how semantic audio priors support localization-aware scene analysis under data, computation, and deployment constraints. The framework is developed through informed multi-stage Neural Architecture Search (NAS). Stage 1 shows that spectral FOA descriptors, based on magnitude, phase, and Intensity Vectors (IVs), provide the most reliable interface for semantic-to-spatial transfer. Stage 2 identifies early residual spatial encoding as the main capacity-sensitive component, while late track-wise abstraction and recurrent smoothing act mainly as refinement stages. Stage 3 shows that late cross-stitch coupling improves semantic-spatial interaction, whereas early fusion is costlier and less effective. Diagnostic evaluation analyzes the selected architecture under class balancing, focal loss, activity-conditioned DOA supervision, threshold calibration, and transfer across STARSS23, TAU2019, TAU-NIGENS2020, and TAU-NIGENS2021. Focal loss improves the activity point, active-only DOA supervision mitigates inactive target dominance, and validation-selected thresholds recover calibration without replacing spatial learning. Cross-dataset and oracle-activity analyses indicate strong fixed source localization on TAU2019, transferable representations from TAU NIGENS2021, and meaningful but uncertain behavior on STARSS23. Overall, GP-AT priors appear promising for SELD design when embedded in spatial-aware architectures and optimized through integrated calibration and deployment oriented strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the AT2SELD framework, which extends pretrained General-Purpose Audio Tagging (GP-AT) models to Sound Event Localization and Detection (SELD) by coupling them with compact First-Order Ambisonics (FOA) spatial processing, track-wise SED and Cartesian DOA estimation, permutation-aware supervision, and calibration. Developed via a three-stage Neural Architecture Search (NAS), the work identifies spectral FOA descriptors as the optimal interface for semantic-to-spatial transfer, early residual spatial encoding as key, and late cross-stitch coupling as beneficial. Diagnostic experiments evaluate focal loss, activity-conditioned supervision, threshold calibration, and cross-dataset transfer across STARSS23, TAU2019, TAU-NIGENS2020, and TAU-NIGENS2021, concluding that GP-AT priors are promising for SELD when embedded in spatial-aware architectures with integrated optimization strategies.

Significance. If the central claims hold, this work demonstrates a practical pathway for leveraging large pretrained audio tagging models in spatially grounded tasks like SELD, potentially improving data efficiency and performance under computational constraints. The multi-stage NAS provides a systematic empirical approach to architecture exploration, and the focus on calibration and deployment strategies addresses practical applicability. The cross-dataset analysis offers insights into transferability. Credit is due for the diagnostic evaluation of loss functions, supervision variants, and the integrated calibration approach.

major comments (2)
  1. [Stage 1 of the NAS analysis] Stage 1 of the NAS analysis: The claim that spectral FOA descriptors based on magnitude, phase, and Intensity Vectors provide the most reliable interface for semantic-to-spatial knowledge transfer is load-bearing for the cross-dataset transfer results and the overall conclusion that GP-AT priors are promising. However, the reported results show strong fixed-source behavior on TAU2019, transferable representations only from TAU-NIGENS2021, and uncertain behavior on STARSS23. This pattern is consistent with the descriptors capturing dataset-specific spatial statistics rather than purely semantic features that transfer independently of source configuration or room acoustics.
  2. [Diagnostic evaluation] Diagnostic evaluation section: The manuscript reports that focal loss improves the activity point, active-only DOA supervision mitigates inactive target dominance, and validation-selected thresholds recover calibration. However, without quantitative ablation tables or error analysis showing the magnitude of these improvements relative to standard baselines (e.g., cross-entropy or full DOA supervision), it is difficult to assess whether these components substantively support the central claim that the GP-AT priors are promising when embedded in the proposed architecture.
minor comments (1)
  1. [Stage 1-3 NAS descriptions] The delineation of the three NAS stages would benefit from explicit statements of the search space, selection criteria, and computational budget for each stage to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, with planned revisions where the concerns identify gaps in evidence or clarity.

read point-by-point responses
  1. Referee: [Stage 1 of the NAS analysis] Stage 1 of the NAS analysis: The claim that spectral FOA descriptors based on magnitude, phase, and Intensity Vectors provide the most reliable interface for semantic-to-spatial knowledge transfer is load-bearing for the cross-dataset transfer results and the overall conclusion that GP-AT priors are promising. However, the reported results show strong fixed-source behavior on TAU2019, transferable representations only from TAU-NIGENS2021, and uncertain behavior on STARSS23. This pattern is consistent with the descriptors capturing dataset-specific spatial statistics rather than purely semantic features that transfer independently of source configuration or room acoustics.

    Authors: We agree that the cross-dataset patterns reported in the manuscript are consistent with partial capture of dataset-specific spatial statistics (particularly the strong fixed-source results on TAU2019). The manuscript already notes this variability, but the referee is correct that the load-bearing claim of a 'most reliable interface for semantic-to-spatial transfer' requires qualification. Stage 1 NAS still shows these descriptors outperforming alternatives, yet the transfer is not purely semantic. We will revise the relevant sections to explicitly discuss this limitation and adjust the strength of the conclusion accordingly. revision: partial

  2. Referee: [Diagnostic evaluation] Diagnostic evaluation section: The manuscript reports that focal loss improves the activity point, active-only DOA supervision mitigates inactive target dominance, and validation-selected thresholds recover calibration. However, without quantitative ablation tables or error analysis showing the magnitude of these improvements relative to standard baselines (e.g., cross-entropy or full DOA supervision), it is difficult to assess whether these components substantively support the central claim that the GP-AT priors are promising when embedded in the proposed architecture.

    Authors: The referee correctly identifies that the diagnostic section relies on qualitative descriptions rather than quantitative ablations. We will add expanded ablation tables in the revised manuscript that report the magnitude of improvements (e.g., error rate deltas, DOA error reductions) for focal loss vs. cross-entropy, active-only vs. full DOA supervision, and calibrated vs. uncalibrated thresholds, each compared against the relevant baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical NAS-driven design with independent cross-dataset validation.

full rationale

The paper reports results from a multi-stage Neural Architecture Search process that empirically identifies effective components (FOA descriptors in Stage 1, residual encoding in Stage 2, cross-stitch in Stage 3) for coupling pretrained GP-AT backbones with spatial SELD processing. These are presented as search outcomes evaluated on held-out data and cross-dataset transfers (TAU2019, TAU-NIGENS2021, STARSS23), not as closed-form predictions or definitions that reduce to their own inputs by construction. No equations, self-citations, or ansatzes are invoked in a load-bearing way that would create circularity. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained semantic audio representations transfer usefully to spatial localization when the right architectural interface is found; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Pretrained GP-AT models contain semantic priors that can support localization-aware scene analysis
    Invoked as the starting point for the entire AT2SELD framework
  • domain assumption Spectral FOA descriptors provide the most reliable interface for semantic-to-spatial transfer
    Conclusion of Stage 1 NAS; treated as a discovered fact rather than proved

pith-pipeline@v0.9.1-grok · 5861 in / 1338 out tokens · 44132 ms · 2026-06-29T03:18:38.131792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Audio Set: An Ontology and Human-Labeled Dataset for Audio Events,

    J.F. Gemmeke et al., “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, 2017, p. 776–780

  2. [2]

    From General-Purpose Audio Tagging to Real-Time Emergency Vehicle Siren Detection,

    S. Giacomelli et al., “From General-Purpose Audio Tagging to Real-Time Emergency Vehicle Siren Detection,”IEEE Transactions on Audio, Speech and Language Processing, pp. 1–16, 2026

  3. [3]

    Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks,

    S. Adavanne et al., “Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2019

  4. [4]

    Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network,

    S. Adavanne, A. Politis, and T. Virtanen, “Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019, pp. 20–24

  5. [5]

    Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications,

    D. Diaz-Guerra et al., “Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications,” inProceedings of Forum Acusticum 2023, 2023, preprint also available as arXiv:2306.08510

  6. [6]

    TheNERC-SLIPSystemforSoundEventLocalizationandDetectionof DCASE2022 Challenge,

    Q.Wangetal., “TheNERC-SLIPSystemforSoundEventLocalizationandDetectionof DCASE2022 Challenge,” inProceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 2022

  7. [7]

    The Neural-SRP Method for Universal Robust Multi-Source Tracking,

    E. Grinstein aet al., “The Neural-SRP Method for Universal Robust Multi-Source Tracking,”IEEE Open Journal of Signal Processing, vol. 5, pp. 19–38, 2024

  8. [8]

    Sound Event Detection: A Tutorial,

    A. Mesaros et al., “Sound Event Detection: A Tutorial,”IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67–83, 2021

  9. [9]

    ACCDOA: Activity-Coupled Cartesian Direction of Arrival Repre- sentation for Sound Event Localization and Detection,

    K. Shimada et al., “ACCDOA: Activity-Coupled Cartesian Direction of Arrival Repre- sentation for Sound Event Localization and Detection,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 915–919

  10. [10]

    Multi-ACCDOA: Localizing and Detecting Overlapping Sounds From the Same Class With Auxiliary Duplicating Permutation Invariant Training,

    ——, “Multi-ACCDOA: Localizing and Detecting Overlapping Sounds From the Same Class With Auxiliary Duplicating Permutation Invariant Training,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 316–320

  11. [11]

    Event-Independent Network for Polyphonic Sound Event Localization and Detection,

    Y. Cao et al., “Event-Independent Network for Polyphonic Sound Event Localization and Detection,” inProceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 2020, pp. 112–116

  12. [12]

    An Improved Event-Independent Network for Polyphonic Sound Event Local- ization and Detection,

    ——, “An Improved Event-Independent Network for Polyphonic Sound Event Local- ization and Detection,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 885–889

  13. [13]

    Attention Is All You Need,

    A. Vaswani et al., “Attention Is All You Need,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6000–6010. 121

  14. [14]

    Cross-stitch networks for multi-task learning,

    I. Misra et al., “Cross-stitch networks for multi-task learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  15. [15]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” inInter- national Conference on Learning Representations (ICLR), 2015, poster presentation, preprint available at ArXiv 10.48550/arXiv.1412.6980

  16. [16]

    Early Stopping — But When?

    L. Prechelt, “Early Stopping — But When?” inNeural Networks: Tricks of the Trade, ser. Lecture Notes in Computer Science, G. Montavon, G. B. Orr, and K.-R. Müller, Eds. Berlin, Heidelberg: Springer, 2012, vol. 7700, pp. 53–67

  17. [17]

    TUT Sound Events 2018 - Ambisonic, Anechoic and Synthetic Impulse Response Dataset,

    S. Adavanne, A. Politis, and T. Virtanen, “TUT Sound Events 2018 - Ambisonic, Anechoic and Synthetic Impulse Response Dataset,” Apr. 2018

  18. [18]

    U Recurrent Neural Network for Polyphonic Sound Event Detection and Localization,

    L. Pi et al., “U Recurrent Neural Network for Polyphonic Sound Event Detection and Localization,” inProceedings of the 2020 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2020, pp. 1–5

  19. [19]

    U-Net: Convolutional Networks for Biomed- ical Image Segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomed- ical Image Segmentation,” inMedical Image Computing and Computer-Assisted Inter- vention – MICCAI 2015. Springer International Publishing, 2015, pp. 234–241

  20. [20]

    The Generalized Correlation Method for Estimation of Time Delay,

    C. Knapp and G. Carter, “The Generalized Correlation Method for Estimation of Time Delay,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976

  21. [21]

    Minimal Gated Unit for Recurrent Neural Networks,

    G. Zhou et al., “Minimal Gated Unit for Recurrent Neural Networks,”International Journal of Automation and Computing, vol. 13, no. 3, pp. 226–234, 2016

  22. [22]

    Resnet-Conformer Network using Multi-Scale Channel Attention for Sound Event Localization and Detection in Real Scenes,

    L. Xue et al., “Resnet-Conformer Network using Multi-Scale Channel Attention for Sound Event Localization and Detection in Real Scenes,” in2023 IEEE 15th Interna- tional Conference on Wireless Communications and Signal Processing (WCSP), 2023, pp. 1–6

  23. [23]

    ResNet-Conformer Network with Shared Weights and Atten- tion Mechanism for Sound Event Localization, Detection, and Distance Estimation,

    Q. T. Vo and D. K. Han, “ResNet-Conformer Network with Shared Weights and Atten- tion Mechanism for Sound Event Localization, Detection, and Distance Estimation,” inProceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), 2024

  24. [24]

    Deep residual learning for image recognition,

    K. He et al., “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  25. [25]

    Zhang et al.,Dive into Deep Learning

    A. Zhang et al.,Dive into Deep Learning. Cambridge University Press, 2023. [Online]. Available: https://d2l.ai/

  26. [26]

    Conformer: Convolution-augmented Transformer for Speech Recogni- tion,

    A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recogni- tion,” inProceedings of Interspeech 2020, 2020, pp. 5036–5040

  27. [27]

    STARSS22: A Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events,

    A. Politis et al., “STARSS22: A Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events,” inProceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 2022, pp. 125–129

  28. [28]

    Starss23: Sony-tau realistic spatial soundscapes 2023,

    ——, “Starss23: Sony-tau realistic spatial soundscapes 2023,” Mar. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7880637 122

  29. [29]

    ESC: Dataset for Environmental Sound Classification,

    K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” inProceedings of the 23rd ACM International Conference on Multimedia. ACM, 2015, pp. 1015–1018

  30. [30]

    FSD50K: An Open Dataset of Human-Labeled Sound Events,

    E. Fonseca et al., “FSD50K: An Open Dataset of Human-Labeled Sound Events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2022

  31. [31]

    TAU-NIGENS Spatial Sound Events 2021: A Synthetic Spatial Sound Events Dataset,

    A. Politis et al., “TAU-NIGENS Spatial Sound Events 2021: A Synthetic Spatial Sound Events Dataset,” inProceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, pp. 189–193

  32. [32]

    AugMix: A Simple Method to Improve Robustness and Uncer- tainty under Data Shift,

    D. Hendrycks et al., “AugMix: A Simple Method to Improve Robustness and Uncer- tainty under Data Shift,” inInternational Conference on Learning Representations, 2020

  33. [33]

    SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,

    D.S. Park et al., “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” inProceedings of Interspeech 2019, 2019, pp. 2613–2617

  34. [34]

    SoundDet: Polyphonic Moving Sound Event De- tection and Localization from Raw Waveform,

    Y. He, N. Trigoni, and A. Markham, “SoundDet: Polyphonic Moving Sound Event De- tection and Localization from Raw Waveform,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 4160–4170, preprint available at arXiv:2106.06969

  35. [35]

    SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms,

    Y. He and A. Markham, “SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms,” inProceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2022, pp. 2408–2412

  36. [36]

    Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers,

    S. Adavanne, A. Politis, and T. Virtanen, “Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers,” in2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 1–5, preprint available at arXiv:2111.00030

  37. [37]

    Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks,

    D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 300–311, 2021

  38. [38]

    E-PANNs: Sound recognition using efficient pre-trained audio neural networks,

    A. Singh, H. Liu, and M. D. Plumbley, “E-PANNs: Sound recognition using efficient pre-trained audio neural networks,” inInter-Noise and Noise-Con Congress and Conference Proceedings, vol. 268, no. 1. Institute of Noise Control Engineering, 2023, pp. 7220–7228

  39. [39]

    Efficient CNNs via Passive Filter Pruning,

    A. Singh and M. D. Plumbley, “Efficient CNNs via Passive Filter Pruning,”IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 33, pp. 1763–1774, 2025

  40. [40]

    Neural Architecture Search: A Survey,

    T. Elsken, J. H. Metzen, and F. Hutter, “Neural Architecture Search: A Survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, 2019

  41. [41]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A.G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,”preprint arXiv:1704.04861, 2017

  42. [42]

    Focal Loss for Dense Object Detection,

    T. Lin et al., “Focal Loss for Dense Object Detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020. 123

  43. [43]

    Decoupled Weight Decay Regularization,

    I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7

  44. [44]

    SGDR: Stochastic Gradient Descent with Warm Restarts,

    ——, “SGDR: Stochastic Gradient Descent with Warm Restarts,” inInternational Conference on Learning Representations (ICLR), 2017. [Online]. Available: https://openreview.net/forum?id=Skq89Scxx

  45. [45]

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events,

    K. Shimada et al., “STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3488–3500, 2023

  46. [46]

    A Multi-Room Reverberant Dataset for Sound Event Localization and Detection,

    S. Adavanne, A. Politis, and T. Virtanen, “A Multi-Room Reverberant Dataset for Sound Event Localization and Detection,” inProceedings of the Detection and Classifi- cation of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA, 2019

  47. [47]

    A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection,

    A. Politis, S. Adavanne, and T. Virtanen, “A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Challenge, 2020. [Online]. Available: https://dcase.community/documents/challenge2 020/technical_reports/DCASE2020_Po...

  48. [48]

    A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection,

    Q. Wang et al., “A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1251–1264, 2023

  49. [49]

    Paszke et al.,PyTorch: an imperative style, high-performance deep learning library

    A. Paszke et al.,PyTorch: an imperative style, high-performance deep learning library. Red Hook, NY, USA: Curran Associates Inc., 2019

  50. [50]

    An Optimal Algorithm for Selection in a Min-Heap,

    G. N. Frederickson, “An Optimal Algorithm for Selection in a Min-Heap,”Information and Computation, vol. 104, no. 2, pp. 197–214, Jun. 1993. 124