Recognition: unknown
Towards Open World Sound Event Detection
Pith reviewed 2026-05-07 12:34 UTC · model grok-4.3
The pith
Sound event detection can now identify unknown events and learn from them incrementally.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To address the challenges of overlapping and ambiguous events, we develop the Open-World Deformable Sound Event Detection Transformer (WOOT) framework that incorporates a 1D Deformable architecture with deformable attention, feature disentanglement to separate class-specific and class-agnostic representations, a one-to-many matching strategy, and a diversity loss. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and signficic
What carries the argument
The WOOT framework built on a 1D Deformable architecture that uses deformable attention to adaptively focus on salient temporal regions, combined with feature disentanglement, one-to-many matching, and diversity loss to manage novel and overlapping sounds.
If this is right
- Known sound events are detected with accuracy at or above existing closed-world leaders.
- Unseen events are identified as novel instead of forced into known classes.
- New event classes can be added incrementally from the identified unknowns.
- Overlapping and ambiguous events are handled more effectively than standard approaches.
- The approach becomes practical for dynamic real-world audio environments.
Where Pith is reading between the lines
- The same separation of known and novel features could extend to other time-series tasks such as video action detection in open settings.
- Lower dependence on exhaustive prior labels might enable longer-term deployment in settings where new sounds appear gradually.
- Pairing the audio framework with visual open-world methods could produce joint multimodal systems that handle unexpected events across senses.
Load-bearing premise
The 1D deformable components together with disentanglement and diversity mechanisms can separate and learn novel events even when audio overlaps or sounds ambiguous.
What would settle it
A dataset of many overlapping novel events where the model shows no gain over baselines in correctly identifying unknowns or incorporating them without accuracy loss.
Figures
read the original abstract
Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Open-World Sound Event Detection (OW-SED) paradigm to extend conventional closed-world SED systems, enabling detection of known events, identification of novel ones, and incremental learning. It proposes the WOOT framework built on a 1D Deformable architecture with deformable attention, feature disentanglement for class-specific/agnostic representations, one-to-many matching, and a diversity loss to address overlapping/ambiguous events.
Significance. If the performance claims hold, this would represent a meaningful step toward practical audio understanding in dynamic real-world settings such as surveillance and smart environments by adapting open-world ideas from vision to audio. The specific architectural choices for handling temporal salience and representation diversity are well-motivated, though their impact requires empirical substantiation.
major comments (2)
- [Abstract] Abstract: the central performance claims (marginally superior closed-world results and significant open-world gains over baselines) are asserted without any quantitative metrics, dataset specifications, ablation studies, or tables/figures showing results. This makes it impossible to evaluate whether the 1D Deformable components, feature disentanglement, one-to-many matching, or diversity loss actually resolve overlapping events or enable incremental learning as stated.
- [Methods] No equations, derivations, or formal definitions of the loss terms, matching strategy, or deformable attention mechanism are provided. Without these, it is unclear how the audio-specific adaptations reduce to concrete, non-circular implementations or differ from standard deformable attention in a way that addresses the stated challenges.
minor comments (1)
- [Introduction] The abstract and introduction repeat the list of proposed components without clarifying their individual contributions or interactions; a dedicated paragraph or diagram would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the potential significance of the OW-SED paradigm and WOOT framework. We agree that the manuscript requires major revision to strengthen the presentation of results and methodological details, and we commit to addressing all points raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (marginally superior closed-world results and significant open-world gains over baselines) are asserted without any quantitative metrics, dataset specifications, ablation studies, or tables/figures showing results. This makes it impossible to evaluate whether the 1D Deformable components, feature disentanglement, one-to-many matching, or diversity loss actually resolve overlapping events or enable incremental learning as stated.
Authors: We agree that the abstract would be strengthened by including concrete quantitative metrics and references to supporting material. In the revised version, we will update the abstract to report key performance numbers (e.g., closed-world mAP and open-world novel-event detection rates on the evaluated datasets) together with explicit pointers to the corresponding tables, figures, and ablation studies. This change will allow readers to directly assess the contributions of the 1D deformable attention, feature disentanglement, one-to-many matching, and diversity loss. revision: yes
-
Referee: [Methods] No equations, derivations, or formal definitions of the loss terms, matching strategy, or deformable attention mechanism are provided. Without these, it is unclear how the audio-specific adaptations reduce to concrete, non-circular implementations or differ from standard deformable attention in a way that addresses the stated challenges.
Authors: We acknowledge that the current manuscript does not supply explicit equations or formal definitions for the loss terms, the one-to-many matching strategy, or the 1D deformable attention mechanism. In the revision we will insert a new subsection in Methods that provides the full mathematical formulations, including the diversity loss, the matching cost, and the deformable attention operator with its audio-specific adaptations (temporal sampling offsets and feature disentanglement). Derivations and implementation details will be added to clarify how these components differ from standard deformable attention and how they mitigate overlapping and ambiguous events. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper proposes the OW-SED paradigm and WOOT framework as an architectural combination of 1D deformable attention, feature disentanglement, one-to-many matching, and diversity loss, with performance claims resting solely on experimental comparisons to baselines. No equations, mathematical derivations, fitted parameters presented as predictions, or self-referential definitions appear in the text. The approach is described as inspired by external computer-vision open-world methods but adapted for audio without any reduction of claims to inputs by construction, self-citation load-bearing arguments, or renaming of known results. The central assertions remain empirical and self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Crocco, M
M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: A systematic review, ACM Computing Surveys (CSUR) 48 (4) (2016) 1– 46
2016
-
[2]
Salamon, J
J. Salamon, J. Bello, C. Silva, O. Nov, R. DuBois, A. Arora, C. Mydlarz, H. Doraiswamy, Sonyc: A system for the monitoring analysis and miti- gation of urban noise pollution, Communications of the ACM 5 (2018)
2018
-
[3]
N. C. Phuong, T. Do Dat, Sound classification for event detection: Ap- plication into medical telemonitoring, in: 2013 International Conference on Computing, Management and Telecommunications (ComManTel), IEEE, 2013, pp. 330–333. 27
2013
-
[4]
Hershey, S
S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al., Cnn archi- tectures for large-scale audio classification, in: 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, 2017, pp. 131–135
2017
-
[5]
Adavanne, P
S. Adavanne, P. Pertilä, T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network, in: 2017 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2017, pp. 771–775
2017
- [6]
-
[7]
Zhang, I
H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2015, pp. 559–563
2015
-
[8]
Y.Li, M.Liu, K.Drossos, T.Virtanen, Soundeventdetectionviadilated convolutional recurrent neural networks, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 286–290
2020
-
[9]
Cakır, G
E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Con- volutional recurrent neural networks for polyphonic sound event detec- tion, IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing 25 (6) (2017) 1291–1303
2017
-
[10]
Joseph, S
K. Joseph, S. Khan, F. S. Khan, V. N. Balasubramanian, Towards open world object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5830–5840
2021
-
[11]
11444–11453
O.Zohar, K.-C.Wang, S.Yeung, Prob: Probabilisticobjectnessforopen world object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11444–11453
2023
-
[12]
S. Ma, Y. Wang, Y. Wei, J. Fan, T. H. Li, H. Liu, F. Lv, Cat: Local- ization and identification cascade detection transformer for open-world 28 object detection, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19681–19690
2023
-
[13]
Gupta, S
A. Gupta, S. Narayan, K. Joseph, S. Khan, F. S. Khan, M. Shah, Ow-detr: Open-world detection transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9235–9244
2022
-
[14]
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: De- formable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020)
work page internal anchor Pith review arXiv 2010
-
[15]
J. Salamon, D. MacConnell, M. Cartwright, P. Li, J. P. Bello, Scaper: A library for soundscape synthesis and augmentation, in: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348.doi:10.1109/WASPAA.2017.8170052
-
[16]
Turpault, R
N. Turpault, R. Serizel, A. P. Shah, J. Salamon, Sound event detection in domestic environments with weakly labeled data and soundscape syn- thesis, in: Workshop on Detection and Classification of Acoustic Scenes and Events, 2019
2019
-
[17]
Mesaros, T
A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detec- tion in real life recordings, in: 2010 18th European Signal Processing Conference, 2010, pp. 1267–1271
2010
-
[18]
D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, Detection and classification of acoustic scenes and events, IEEE Trans- actions on Multimedia 17 (10) (2015) 1733–1746.doi:10.1109/TMM. 2015.2428998
work page doi:10.1109/tmm 2015
-
[19]
K. J. Piczak, Environmental sound classification with convolutional neural networks, in: 2015 IEEE 25th International Workshop on Ma- chine Learning for Signal Processing (MLSP), 2015, pp. 1–6.doi: 10.1109/MLSP.2015.7324337
-
[20]
H. Nam, S.-H. Kim, B.-Y. Ko, Y.-H. Park, Frequency Dynamic Con- volution: Frequency-Adaptive Pattern Recognition for Sound Event Detection, in: Proc. Interspeech 2022, 2022, pp. 2763–2767.doi: 10.21437/Interspeech.2022-10127. 29
-
[21]
K. Li, Y. Song, L.-R. Dai, I. McLoughlin, X. Fang, L. Liu, Ast-sed: An effective sound event detection method based on audio spectrogram transformer, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5
2023
-
[22]
Gulati, J
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang, Conformer: Convolution- augmented transformer for speech recognition., in: H. Meng, B. Xu, T. F. Zheng (Eds.), INTERSPEECH, ISCA, 2020, pp. 5036–5040
2020
-
[23]
S. Barahona, D. de Benito-Gorrón, D. T. Toledano, D. Ramos, En- hancing conformer-based sound event detection using frequency dy- namic convolutions and beats audio embeddings, IEEE/ACM Transac- tions on Audio, Speech, and Language Processing 32 (2024) 3896–3907. doi:10.1109/TASLP.2024.3444490
-
[24]
N.Carion, F.Massa, G.Synnaeve, N.Usunier, A.Kirillov, S.Zagoruyko, End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229
2020
- [25]
-
[26]
J. You, W. Wu, J. Lee, Open set classification of sound event, Scientific Reports 14 (01 2024).doi:10.1038/s41598-023-50639-7
- [27]
- [28]
- [29]
- [30]
- [31]
-
[32]
D. Pershouse, F. Dayoub, D. Miller, N. Sünderhauf, Addressing the chal- lenges of open-world object detection, arXiv preprint arXiv:2303.14930 (2023)
-
[33]
Bendale, T
A. Bendale, T. Boult, Towards open set deep networks, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016
2016
-
[34]
L. Shu, H. Xu, B. Liu, DOC: Deep open classification of text docu- ments, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- ing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2911–2916.doi:10.18653/v1/D17-1314. URLhttps://aclanthology.org/D17-1314/
-
[35]
Zamzmi, T
G. Zamzmi, T. Oguguo, S. Rajaraman, S. Antani, Open-world active learning for echocardiography view classification, in: Medical Imaging 2022: Computer-Aided Diagnosis, Vol. 12033, SPIE, 2022, pp. 138–148
2022
-
[36]
L. Zheng, D. Liu, T. Wu, Y. Chen, Stwwgram-odcbam: Mul- timodal feature fusion and dynamic attention mechanism for anomalous sound detection, Signal Processing 239 (2026) 110218. doi:https://doi.org/10.1016/j.sigpro.2025.110218. URLhttps://www.sciencedirect.com/science/article/pii/ S0165168425003329
-
[37]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
2016
-
[38]
J. Chen, J. Hao, K. Chen, D. Xie, S. Yang, S. Pu, An end-to-end audio classification system based on raw waveforms and mix-training 31 strategy, in: Interspeech 2019, 2019, pp. 3644–3648.doi:10.21437/ Interspeech.2019-1579
2019
-
[39]
S. S. Mullappilly, A. S. Gehlot, R. M. Anwer, F. S. Khan, H. Cholakkal, Semi-supervised open-world object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 4305– 4314. 32
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.