Towards Open World Sound Event Detection

L.H.Son; L.T.Minh; P.H.Hai

arxiv: 2605.03934 · v2 · pith:EFDHBLNAnew · submitted 2026-05-05 · 💻 cs.SD · cs.AI

Towards Open World Sound Event Detection

P.H.Hai , L.T.Minh , L.H.Son This is my paper

Pith reviewed 2026-05-22 10:39 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords open-world sound event detectiondeformable attentiontransformerfeature disentanglementaudio event detectionincremental learningunknown event identification

0 comments

The pith

The WOOT framework detects known sound events while identifying and learning from unseen ones in real-world audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an open-world paradigm for sound event detection that moves beyond closed sets of known classes. It introduces the WOOT transformer that uses deformable attention to focus on relevant time segments and applies feature disentanglement plus a diversity loss to separate class-specific details from general audio patterns. This setup lets the model flag novel events, match them in a one-to-many way, and support incremental updates. A sympathetic reader cares because surveillance, smart cities, and healthcare systems encounter unexpected sounds that standard detectors simply miss or misclassify. If the approach holds, audio models become more practical without requiring exhaustive pre-labeling of every possible event.

Core claim

We introduce the Open-World Sound Event Detection (OW-SED) paradigm together with the Open-World Deformable Sound Event Detection Transformer (WOOT). The framework combines a 1D deformable architecture for adaptive temporal focus, feature disentanglement to isolate class-specific from class-agnostic representations, a one-to-many matching strategy, and a diversity loss. Experiments show the method performs marginally better than leading techniques under closed-world conditions and significantly outperforms baselines when novel events appear.

What carries the argument

The 1D Deformable architecture inside WOOT, which uses deformable attention together with feature disentanglement and diversity loss to adaptively select temporal regions and separate representations for known versus unknown events.

If this is right

Sound detection systems can now flag and later incorporate previously unseen acoustic events without full retraining.
Performance remains competitive or slightly better when all events are known in advance.
Real-world applications such as surveillance and healthcare gain robustness to the natural emergence of new sounds.
One-to-many matching plus diversity loss reduces collapse of representations for similar or ambiguous audio.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of specific and agnostic features could be tested on speech or music tasks where new categories appear over time.
Pairing the audio model with visual open-world detectors might improve joint scene understanding in multimodal settings.
If the diversity loss proves general, it could reduce the amount of labeled data needed when adapting to new acoustic domains.

Load-bearing premise

Deformable attention combined with feature disentanglement and diversity loss can reliably separate class-specific from class-agnostic features and manage overlapping or ambiguous events without extra supervision.

What would settle it

Run the system on a test set containing many overlapping novel sounds never seen in training; if it shows no clear gain over standard baselines in unknown-event recall or produces frequent false positives on known classes, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.03934 by L.H.Son, L.T.Minh, P.H.Hai.

**Figure 1.** Figure 1: Introduction to the Open-World Sound Event Detection (OW-SED) task view at source ↗

**Figure 2.** Figure 2: Illustration of the WOOT model architecture. The proposed WOOT is built upon a 1D Deformable Transformer backbone specifically tailored for sound event detection. It introduces a transformer encoder with 1D Deformable Self-Attention (1D-DSA) to enable efficient temporal modeling, a decoder with 1D Deformable Cross-Attention (1D-DCA) to progressively refine event representations, and a specialized predicti… view at source ↗

**Figure 3.** Figure 3: Visualization of the outputs from PROB and our framework compared with the view at source ↗

read the original abstract

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts open-world detection from vision to sound events via a 1D deformable transformer and feature disentanglement, but the open-world gains lack direct checks that the separation actually works under overlap.

read the letter

The main takeaway is that the authors define an OW-SED setting and build WOOT around deformable attention plus a split into class-specific and class-agnostic branches, with one-to-many matching and a diversity loss. That combination is the concrete new piece relative to prior closed-world SED transformers. They report the base model stays competitive in closed settings and pulls ahead when novel events appear, which at least shows the architecture does not collapse on standard tasks. Credit is due for spelling out the temporal overlap problem and trying to handle it without extra labels. The approach is a straightforward transplant of vision ideas rather than a wholly new paradigm, but the transplant itself is executed with some care for audio specifics. The soft spot is exactly where the stress-test points: the headline open-world improvement rests on the claim that disentanglement plus diversity loss cleanly isolates known-class features from background and novel sounds. No ablation removes the disentanglement term, no t-SNE or mutual-information numbers are shown for the two branches, and the abstract gives no quantitative evidence that the agnostic branch stays free of class leakage when events overlap. Without those checks the performance delta could come from better overall modeling rather than the intended separation. This is the sort of paper that belongs in a reading group for people already working on audio transformers or open-set detection; it gives them a concrete architecture to try and a clear list of missing controls to add. It is not yet ready for citation as a solved result, but the direction is worth referee time because the problem is real and the proposed components are testable. I would send it out for review with a request for the missing ablations and separation metrics.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Open-World Sound Event Detection (OW-SED) paradigm, in which models must detect known events, identify unseen ones, and support incremental learning. It proposes the WOOT framework built on a 1D Deformable architecture that employs deformable attention, feature disentanglement to isolate class-specific versus class-agnostic representations, one-to-many matching, and a diversity loss. Experiments are reported to show marginally superior closed-world performance and substantially better open-world results relative to existing baselines.

Significance. If the claimed gains prove robust, the work would meaningfully extend sound event detection beyond closed-world assumptions, addressing practical challenges such as novel events and temporal overlaps that arise in surveillance, smart-city, and healthcare applications. The explicit formulation of an OW-SED task and the architectural adaptations for audio are timely given parallel progress in open-world vision.

major comments (3)

[§3.3] §3.3 (Feature Disentanglement): the manuscript describes the separation of class-specific and class-agnostic branches but provides no quantitative diagnostic (class-conditional mutual information, branch-wise t-SNE separation scores, or an ablation that removes the disentanglement term) to confirm that the class-agnostic branch reliably captures only background and overlap content. Without such verification the central claim that the architecture handles ambiguous or overlapping events without extra supervision remains untested.
[§5.2, Table 3] §5.2 and Table 3 (Open-world results): performance improvements are stated without error bars across random seeds, without statistical significance tests, and without an ablation that isolates the contribution of the diversity loss or one-to-many matching under controlled overlap conditions. These omissions make it impossible to attribute the reported gains specifically to the proposed mechanisms rather than to other implementation details.
[§4.1] §4.1 (Deformable attention): the claim that deformable attention adaptively focuses on salient temporal regions for overlapping events is not supported by any targeted analysis (e.g., attention-map visualizations on synthetic overlap mixtures or comparison against standard multi-head attention on the same mixtures). This analysis is load-bearing for the assertion that the architecture is particularly suited to OW-SED.

minor comments (2)

[Abstract] Abstract: the phrases 'marginally superior' and 'significantly improves' should be replaced by concrete metric deltas (e.g., +1.2 % mAP) so readers can immediately gauge the magnitude of the reported gains.
[§3.1] Notation in §3.1: the symbols for the class-agnostic and class-specific feature tensors are introduced without an explicit dimension table; adding a short table of tensor shapes would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions where appropriate to enhance the rigor of our claims.

read point-by-point responses

Referee: [§3.3] §3.3 (Feature Disentanglement): the manuscript describes the separation of class-specific and class-agnostic branches but provides no quantitative diagnostic (class-conditional mutual information, branch-wise t-SNE separation scores, or an ablation that removes the disentanglement term) to confirm that the class-agnostic branch reliably captures only background and overlap content. Without such verification the central claim that the architecture handles ambiguous or overlapping events without extra supervision remains untested.

Authors: We agree that quantitative verification of the feature disentanglement would strengthen the manuscript. In the revised version, we will add an ablation study that removes the disentanglement loss and reports the impact on performance. Additionally, we will compute and report class-conditional mutual information between the class-specific and class-agnostic branches, as well as include t-SNE visualizations to demonstrate the separation of representations. This will provide evidence that the class-agnostic branch captures background and overlap information. revision: yes
Referee: [§5.2, Table 3] §5.2 and Table 3 (Open-world results): performance improvements are stated without error bars across random seeds, without statistical significance tests, and without an ablation that isolates the contribution of the diversity loss or one-to-many matching under controlled overlap conditions. These omissions make it impossible to attribute the reported gains specifically to the proposed mechanisms rather than to other implementation details.

Authors: We acknowledge the importance of statistical rigor and ablations. We will rerun all experiments with multiple random seeds (e.g., 5 seeds) and report mean performance with standard deviations in the revised tables. We will also perform and include ablations that isolate the effects of the diversity loss and the one-to-many matching strategy, particularly under varying overlap conditions. Statistical significance tests (e.g., paired t-tests) will be added to support the improvements. revision: yes
Referee: [§4.1] §4.1 (Deformable attention): the claim that deformable attention adaptively focuses on salient temporal regions for overlapping events is not supported by any targeted analysis (e.g., attention-map visualizations on synthetic overlap mixtures or comparison against standard multi-head attention on the same mixtures). This analysis is load-bearing for the assertion that the architecture is particularly suited to OW-SED.

Authors: We recognize that targeted analysis is needed to substantiate the benefits of deformable attention in handling overlaps. In the revision, we will include visualizations of attention maps on synthetic audio mixtures with controlled overlaps. We will also provide a direct comparison of deformable attention versus standard multi-head attention on these mixtures, quantifying the focus on salient regions and the resulting detection performance. This will demonstrate the suitability of the architecture for OW-SED. revision: yes

Circularity Check

0 steps flagged

WOOT framework claims rest on experimental outcomes with no self-referential derivations

full rationale

The paper introduces the OW-SED paradigm and WOOT architecture via architectural choices (deformable attention, feature disentanglement, diversity loss, one-to-many matching) whose performance is reported as empirical results on benchmarks rather than any closed-form derivation or fitted parameter renamed as a prediction. No equations appear in the provided text that define a quantity in terms of itself or reduce a claimed result to a self-citation chain. The central claims are therefore self-contained against external datasets and baselines; the separation of class-specific versus class-agnostic features is presented as an empirical outcome of the proposed losses, not as a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; model components such as diversity loss are mentioned but not formalized.

pith-pipeline@v0.9.0 · 5720 in / 1010 out tokens · 31977 ms · 2026-05-22T10:39:19.104059+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions... feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ldis = 1/N Σ |q_agn · q_spec| / (||q_agn|| ||q_spec||)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Crocco, M

M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: A systematic review, ACM Computing Surveys (CSUR) 48 (4) (2016) 1– 46

work page 2016
[2]

Salamon, J

J. Salamon, J. Bello, C. Silva, O. Nov, R. DuBois, A. Arora, C. Mydlarz, H. Doraiswamy, Sonyc: A system for the monitoring analysis and miti- gation of urban noise pollution, Communications of the ACM 5 (2018)

work page 2018
[3]

N. C. Phuong, T. Do Dat, Sound classification for event detection: Ap- plication into medical telemonitoring, in: 2013 International Conference on Computing, Management and Telecommunications (ComManTel), IEEE, 2013, pp. 330–333. 27

work page 2013
[4]

Hershey, S

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al., Cnn archi- tectures for large-scale audio classification, in: 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, 2017, pp. 131–135

work page 2017
[5]

Adavanne, P

S. Adavanne, P. Pertilä, T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network, in: 2017 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2017, pp. 771–775

work page 2017
[6]

Z. Ye, X. Wang, H. Liu, Y. Qian, R. Tao, L. Yan, K. Ouchi, Sound event detectiontransformer: Anevent-basedend-to-endmodelforsoundevent detection, arXiv preprint arXiv:2110.02011 (2021)

work page arXiv 2021
[7]

Zhang, I

H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2015, pp. 559–563

work page 2015
[8]

Y.Li, M.Liu, K.Drossos, T.Virtanen, Soundeventdetectionviadilated convolutional recurrent neural networks, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 286–290

work page 2020
[9]

Cakır, G

E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Con- volutional recurrent neural networks for polyphonic sound event detec- tion, IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing 25 (6) (2017) 1291–1303

work page 2017
[10]

Joseph, S

K. Joseph, S. Khan, F. S. Khan, V. N. Balasubramanian, Towards open world object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5830–5840

work page 2021
[11]

11444–11453

O.Zohar, K.-C.Wang, S.Yeung, Prob: Probabilisticobjectnessforopen world object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11444–11453

work page 2023
[12]

S. Ma, Y. Wang, Y. Wei, J. Fan, T. H. Li, H. Liu, F. Lv, Cat: Local- ization and identification cascade detection transformer for open-world 28 object detection, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19681–19690

work page 2023
[13]

Gupta, S

A. Gupta, S. Narayan, K. Joseph, S. Khan, F. S. Khan, M. Shah, Ow-detr: Open-world detection transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9235–9244

work page 2022
[14]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: De- formable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[15]

Salamon, D

J. Salamon, D. MacConnell, M. Cartwright, P. Li, J. P. Bello, Scaper: A library for soundscape synthesis and augmentation, in: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348.doi:10.1109/WASPAA.2017.8170052

work page doi:10.1109/waspaa.2017.8170052 2017
[16]

Turpault, R

N. Turpault, R. Serizel, A. P. Shah, J. Salamon, Sound event detection in domestic environments with weakly labeled data and soundscape syn- thesis, in: Workshop on Detection and Classification of Acoustic Scenes and Events, 2019

work page 2019
[17]

Mesaros, T

A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detec- tion in real life recordings, in: 2010 18th European Signal Processing Conference, 2010, pp. 1267–1271

work page 2010
[18]

Stowell, D

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, Detection and classification of acoustic scenes and events, IEEE Trans- actions on Multimedia 17 (10) (2015) 1733–1746.doi:10.1109/TMM. 2015.2428998

work page doi:10.1109/tmm 2015
[19]

K. J. Piczak, Environmental sound classification with convolutional neural networks, in: 2015 IEEE 25th International Workshop on Ma- chine Learning for Signal Processing (MLSP), 2015, pp. 1–6.doi: 10.1109/MLSP.2015.7324337

work page doi:10.1109/mlsp.2015.7324337 2015
[20]

Nam, S.-H

H. Nam, S.-H. Kim, B.-Y. Ko, Y.-H. Park, Frequency Dynamic Con- volution: Frequency-Adaptive Pattern Recognition for Sound Event Detection, in: Proc. Interspeech 2022, 2022, pp. 2763–2767.doi: 10.21437/Interspeech.2022-10127. 29

work page doi:10.21437/interspeech.2022-10127 2022
[21]

K. Li, Y. Song, L.-R. Dai, I. McLoughlin, X. Fang, L. Liu, Ast-sed: An effective sound event detection method based on audio spectrogram transformer, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5

work page 2023
[22]

Gulati, J

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang, Conformer: Convolution- augmented transformer for speech recognition., in: H. Meng, B. Xu, T. F. Zheng (Eds.), INTERSPEECH, ISCA, 2020, pp. 5036–5040

work page 2020
[23]

Barahona, D

S. Barahona, D. de Benito-Gorrón, D. T. Toledano, D. Ramos, En- hancing conformer-based sound event detection using frequency dy- namic convolutions and beats audio embeddings, IEEE/ACM Transac- tions on Audio, Speech, and Language Processing 32 (2024) 3896–3907. doi:10.1109/TASLP.2024.3444490

work page doi:10.1109/taslp.2024.3444490 2024
[24]

N.Carion, F.Massa, G.Synnaeve, N.Usunier, A.Kirillov, S.Zagoruyko, End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229

work page 2020
[25]

H. Yin, J. Chen, J. Bai, M. Wang, S. Rahardja, D. Shi, W. seng Gan, Multi-granularity acoustic information fusion for sound event detection, Signal Processing 227 (2025) 109691.doi:https://doi.org/10.1016/ j.sigpro.2024.109691

work page arXiv 2025
[26]

J. You, W. Wu, J. Lee, Open set classification of sound event, Scientific Reports 14 (01 2024).doi:10.1038/s41598-023-50639-7

work page doi:10.1038/s41598-023-50639-7 2024
[27]

P. Cai, Y. Song, Q. Gu, N. Jiang, H. Song, I. McLoughlin, Detect any sound: Open-vocabulary sound event detection with multi-modal queries (2025).arXiv:2507.16343. URLhttps://arxiv.org/abs/2507.16343

work page arXiv 2025
[28]

J. Hai, H. Wang, W. Guo, M. Elhilali, Flexsed: Towards open- vocabulary sound event detection (2025).arXiv:2509.18606. URLhttps://arxiv.org/abs/2509.18606

work page arXiv 2025
[29]

Y. Xiao, R. K. Das, Ucil: An unsupervised class incremental learning approach for sound event detection (2025).arXiv:2407.03657. URLhttps://arxiv.org/abs/2407.03657 30

work page arXiv 2025
[30]

Pandey, M

R. Pandey, M. Mulimani, A. Politis, A. Mesaros, Class-incremental learning for sound event localization and detection (2024).arXiv: 2411.12830. URLhttps://arxiv.org/abs/2411.12830

work page arXiv 2024
[31]

N. Dong, Y. Zhang, M. Ding, G. H. Lee, Open world detr: Transformer based open world object detection, arXiv preprint arXiv:2212.02969 (2022)

work page arXiv 2022
[32]

Pershouse, F

D. Pershouse, F. Dayoub, D. Miller, N. Sünderhauf, Addressing the chal- lenges of open-world object detection, arXiv preprint arXiv:2303.14930 (2023)

work page arXiv 2023
[33]

Bendale, T

A. Bendale, T. Boult, Towards open set deep networks, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016

work page 2016
[34]

L. Shu, H. Xu, B. Liu, DOC: Deep open classification of text docu- ments, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- ing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2911–2916.doi:10.18653/v1/D17-1314. URLhttps://aclanthology.org/D17-1314/

work page doi:10.18653/v1/d17-1314 2017
[35]

Zamzmi, T

G. Zamzmi, T. Oguguo, S. Rajaraman, S. Antani, Open-world active learning for echocardiography view classification, in: Medical Imaging 2022: Computer-Aided Diagnosis, Vol. 12033, SPIE, 2022, pp. 138–148

work page 2022
[36]

Zheng, D

L. Zheng, D. Liu, T. Wu, Y. Chen, Stwwgram-odcbam: Mul- timodal feature fusion and dynamic attention mechanism for anomalous sound detection, Signal Processing 239 (2026) 110218. doi:https://doi.org/10.1016/j.sigpro.2025.110218. URLhttps://www.sciencedirect.com/science/article/pii/ S0165168425003329

work page doi:10.1016/j.sigpro.2025.110218 2026
[37]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[38]

J. Chen, J. Hao, K. Chen, D. Xie, S. Yang, S. Pu, An end-to-end audio classification system based on raw waveforms and mix-training 31 strategy, in: Interspeech 2019, 2019, pp. 3644–3648.doi:10.21437/ Interspeech.2019-1579

work page 2019
[39]

S. S. Mullappilly, A. S. Gehlot, R. M. Anwer, F. S. Khan, H. Cholakkal, Semi-supervised open-world object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 4305– 4314. 32

work page 2024

[1] [1]

Crocco, M

M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: A systematic review, ACM Computing Surveys (CSUR) 48 (4) (2016) 1– 46

work page 2016

[2] [2]

Salamon, J

J. Salamon, J. Bello, C. Silva, O. Nov, R. DuBois, A. Arora, C. Mydlarz, H. Doraiswamy, Sonyc: A system for the monitoring analysis and miti- gation of urban noise pollution, Communications of the ACM 5 (2018)

work page 2018

[3] [3]

N. C. Phuong, T. Do Dat, Sound classification for event detection: Ap- plication into medical telemonitoring, in: 2013 International Conference on Computing, Management and Telecommunications (ComManTel), IEEE, 2013, pp. 330–333. 27

work page 2013

[4] [4]

Hershey, S

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al., Cnn archi- tectures for large-scale audio classification, in: 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, 2017, pp. 131–135

work page 2017

[5] [5]

Adavanne, P

S. Adavanne, P. Pertilä, T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network, in: 2017 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2017, pp. 771–775

work page 2017

[6] [6]

Z. Ye, X. Wang, H. Liu, Y. Qian, R. Tao, L. Yan, K. Ouchi, Sound event detectiontransformer: Anevent-basedend-to-endmodelforsoundevent detection, arXiv preprint arXiv:2110.02011 (2021)

work page arXiv 2021

[7] [7]

Zhang, I

H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2015, pp. 559–563

work page 2015

[8] [8]

Y.Li, M.Liu, K.Drossos, T.Virtanen, Soundeventdetectionviadilated convolutional recurrent neural networks, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 286–290

work page 2020

[9] [9]

Cakır, G

E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Con- volutional recurrent neural networks for polyphonic sound event detec- tion, IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing 25 (6) (2017) 1291–1303

work page 2017

[10] [10]

Joseph, S

K. Joseph, S. Khan, F. S. Khan, V. N. Balasubramanian, Towards open world object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5830–5840

work page 2021

[11] [11]

11444–11453

O.Zohar, K.-C.Wang, S.Yeung, Prob: Probabilisticobjectnessforopen world object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11444–11453

work page 2023

[12] [12]

S. Ma, Y. Wang, Y. Wei, J. Fan, T. H. Li, H. Liu, F. Lv, Cat: Local- ization and identification cascade detection transformer for open-world 28 object detection, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19681–19690

work page 2023

[13] [13]

Gupta, S

A. Gupta, S. Narayan, K. Joseph, S. Khan, F. S. Khan, M. Shah, Ow-detr: Open-world detection transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9235–9244

work page 2022

[14] [14]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: De- formable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[15] [15]

Salamon, D

J. Salamon, D. MacConnell, M. Cartwright, P. Li, J. P. Bello, Scaper: A library for soundscape synthesis and augmentation, in: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348.doi:10.1109/WASPAA.2017.8170052

work page doi:10.1109/waspaa.2017.8170052 2017

[16] [16]

Turpault, R

N. Turpault, R. Serizel, A. P. Shah, J. Salamon, Sound event detection in domestic environments with weakly labeled data and soundscape syn- thesis, in: Workshop on Detection and Classification of Acoustic Scenes and Events, 2019

work page 2019

[17] [17]

Mesaros, T

A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detec- tion in real life recordings, in: 2010 18th European Signal Processing Conference, 2010, pp. 1267–1271

work page 2010

[18] [18]

Stowell, D

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, Detection and classification of acoustic scenes and events, IEEE Trans- actions on Multimedia 17 (10) (2015) 1733–1746.doi:10.1109/TMM. 2015.2428998

work page doi:10.1109/tmm 2015

[19] [19]

K. J. Piczak, Environmental sound classification with convolutional neural networks, in: 2015 IEEE 25th International Workshop on Ma- chine Learning for Signal Processing (MLSP), 2015, pp. 1–6.doi: 10.1109/MLSP.2015.7324337

work page doi:10.1109/mlsp.2015.7324337 2015

[20] [20]

Nam, S.-H

H. Nam, S.-H. Kim, B.-Y. Ko, Y.-H. Park, Frequency Dynamic Con- volution: Frequency-Adaptive Pattern Recognition for Sound Event Detection, in: Proc. Interspeech 2022, 2022, pp. 2763–2767.doi: 10.21437/Interspeech.2022-10127. 29

work page doi:10.21437/interspeech.2022-10127 2022

[21] [21]

K. Li, Y. Song, L.-R. Dai, I. McLoughlin, X. Fang, L. Liu, Ast-sed: An effective sound event detection method based on audio spectrogram transformer, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5

work page 2023

[22] [22]

Gulati, J

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang, Conformer: Convolution- augmented transformer for speech recognition., in: H. Meng, B. Xu, T. F. Zheng (Eds.), INTERSPEECH, ISCA, 2020, pp. 5036–5040

work page 2020

[23] [23]

Barahona, D

S. Barahona, D. de Benito-Gorrón, D. T. Toledano, D. Ramos, En- hancing conformer-based sound event detection using frequency dy- namic convolutions and beats audio embeddings, IEEE/ACM Transac- tions on Audio, Speech, and Language Processing 32 (2024) 3896–3907. doi:10.1109/TASLP.2024.3444490

work page doi:10.1109/taslp.2024.3444490 2024

[24] [24]

N.Carion, F.Massa, G.Synnaeve, N.Usunier, A.Kirillov, S.Zagoruyko, End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229

work page 2020

[25] [25]

H. Yin, J. Chen, J. Bai, M. Wang, S. Rahardja, D. Shi, W. seng Gan, Multi-granularity acoustic information fusion for sound event detection, Signal Processing 227 (2025) 109691.doi:https://doi.org/10.1016/ j.sigpro.2024.109691

work page arXiv 2025

[26] [26]

J. You, W. Wu, J. Lee, Open set classification of sound event, Scientific Reports 14 (01 2024).doi:10.1038/s41598-023-50639-7

work page doi:10.1038/s41598-023-50639-7 2024

[27] [27]

P. Cai, Y. Song, Q. Gu, N. Jiang, H. Song, I. McLoughlin, Detect any sound: Open-vocabulary sound event detection with multi-modal queries (2025).arXiv:2507.16343. URLhttps://arxiv.org/abs/2507.16343

work page arXiv 2025

[28] [28]

J. Hai, H. Wang, W. Guo, M. Elhilali, Flexsed: Towards open- vocabulary sound event detection (2025).arXiv:2509.18606. URLhttps://arxiv.org/abs/2509.18606

work page arXiv 2025

[29] [29]

Y. Xiao, R. K. Das, Ucil: An unsupervised class incremental learning approach for sound event detection (2025).arXiv:2407.03657. URLhttps://arxiv.org/abs/2407.03657 30

work page arXiv 2025

[30] [30]

Pandey, M

R. Pandey, M. Mulimani, A. Politis, A. Mesaros, Class-incremental learning for sound event localization and detection (2024).arXiv: 2411.12830. URLhttps://arxiv.org/abs/2411.12830

work page arXiv 2024

[31] [31]

N. Dong, Y. Zhang, M. Ding, G. H. Lee, Open world detr: Transformer based open world object detection, arXiv preprint arXiv:2212.02969 (2022)

work page arXiv 2022

[32] [32]

Pershouse, F

D. Pershouse, F. Dayoub, D. Miller, N. Sünderhauf, Addressing the chal- lenges of open-world object detection, arXiv preprint arXiv:2303.14930 (2023)

work page arXiv 2023

[33] [33]

Bendale, T

A. Bendale, T. Boult, Towards open set deep networks, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016

work page 2016

[34] [34]

L. Shu, H. Xu, B. Liu, DOC: Deep open classification of text docu- ments, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- ing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2911–2916.doi:10.18653/v1/D17-1314. URLhttps://aclanthology.org/D17-1314/

work page doi:10.18653/v1/d17-1314 2017

[35] [35]

Zamzmi, T

G. Zamzmi, T. Oguguo, S. Rajaraman, S. Antani, Open-world active learning for echocardiography view classification, in: Medical Imaging 2022: Computer-Aided Diagnosis, Vol. 12033, SPIE, 2022, pp. 138–148

work page 2022

[36] [36]

Zheng, D

L. Zheng, D. Liu, T. Wu, Y. Chen, Stwwgram-odcbam: Mul- timodal feature fusion and dynamic attention mechanism for anomalous sound detection, Signal Processing 239 (2026) 110218. doi:https://doi.org/10.1016/j.sigpro.2025.110218. URLhttps://www.sciencedirect.com/science/article/pii/ S0165168425003329

work page doi:10.1016/j.sigpro.2025.110218 2026

[37] [37]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[38] [38]

J. Chen, J. Hao, K. Chen, D. Xie, S. Yang, S. Pu, An end-to-end audio classification system based on raw waveforms and mix-training 31 strategy, in: Interspeech 2019, 2019, pp. 3644–3648.doi:10.21437/ Interspeech.2019-1579

work page 2019

[39] [39]

S. S. Mullappilly, A. S. Gehlot, R. M. Anwer, F. S. Khan, H. Cholakkal, Semi-supervised open-world object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 4305– 4314. 32

work page 2024