pith. sign in

arxiv: 2106.06969 · v2 · pith:ALN2OQHOnew · submitted 2021-06-13 · 💻 cs.SD · cs.LG· eess.AS

SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform

classification 💻 cs.SD cs.LGeess.AS
keywords sounddeteventdetectionsoundtemporalwaveformlocalizationproposal
0
0 comments X
read the original abstract

We present a new framework SoundDet, which is an end-to-end trainable and light-weight framework, for polyphonic moving sound event detection and localization. Prior methods typically approach this problem by preprocessing raw waveform into time-frequency representations, which is more amenable to process with well-established image processing pipelines. Prior methods also detect in segment-wise manner, leading to incomplete and partial detections. SoundDet takes a novel approach and directly consumes the raw, multichannel waveform and treats the spatio-temporal sound event as a complete "sound-object" to be detected. Specifically, SoundDet consists of a backbone neural network and two parallel heads for temporal detection and spatial localization, respectively. Given the large sampling rate of raw waveform, the backbone network first learns a set of phase-sensitive and frequency-selective bank of filters to explicitly retain direction-of-arrival information, whilst being highly computationally and parametrically efficient than standard 1D/2D convolution. A dense sound event proposal map is then constructed to handle the challenges of predicting events with large varying temporal duration. Accompanying the dense proposal map are a temporal overlapness map and a motion smoothness map that measure a proposal's confidence to be an event from temporal detection accuracy and movement consistency perspective. Involving the two maps guarantees SoundDet to be trained in a spatio-temporally unified manner. Experimental results on the public DCASE dataset show the advantage of SoundDet on both segment-based and our newly proposed event-based evaluation system.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From General-Purpose Audio Tagging to Spatially Grounded Sound Event Localization and Detection

    cs.SD 2026-06 unverdicted novelty 4.0

    AT2SELD extends pretrained audio tagging backbones to SELD via FOA descriptors, track-wise processing, permutation-aware supervision, and staged NAS on multiple datasets.