HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods
Pith reviewed 2026-05-24 20:16 UTC · model grok-4.3
The pith
An ensemble of consistency-regularized and MixUp-trained CRNN models detects domestic sound events with 42.0% event-based F-measure, up from the 25.8% baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training an ensemble of CRNN models, each using one of three semi-supervised principles (consistency regularization with data augmentation, MixUp regularization, and MixUp on data augmentations) on a mix of weakly labeled, synthetic, and unlabeled domestic sound data, the HODGEPODGE approach achieves an event-based F-measure of 42.0% on the DCASE 2019 Task 4 evaluation dataset, compared to 25.8% for the baseline.
What carries the argument
Ensemble of CRNN models each trained with one of three semi-supervised principles (consistency regularization via data augmentation, MixUp regularizer, MixUp on augmentations) to use weakly labeled plus unlabeled data.
If this is right
- The approach makes fuller use of large volumes of unlabeled in-domain recordings than supervised training alone.
- Models trained under each of the three principles capture complementary signals that improve results when ensembled.
- The method is directly applicable to the DCASE 2019 Task 4 data setting of domestic environments with weak and synthetic labels.
- Event-based F-measure on the official evaluation set rises from 25.8% to 42.0% under this training regime.
Where Pith is reading between the lines
- The same consistency and MixUp ensemble could be applied to other audio tasks that combine weak labels with abundant unlabeled recordings, such as environmental sound classification.
- Results may depend on the specific ratio of weakly labeled to unlabeled data present in the DCASE collection, so performance on differently balanced corpora would test generality.
- If the gains require extensive per-challenge tuning, the principles would need explicit robustness checks before use in new acoustic environments.
Load-bearing premise
The observed performance gain comes from the three semi-supervised principles and their ensemble rather than from challenge-specific hyperparameter tuning or fitting to evaluation artifacts.
What would settle it
Retraining the same ensemble architecture and principles on a new domestic audio collection with the same mix of weak, synthetic, and unlabeled data but a different test set, and measuring no improvement over a plain supervised CRNN, would falsify the central claim.
Figures
read the original abstract
In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge Task 4: Sound event detection in domestic environments. To perform this task, we adopted the convolutional recurrent neural networks (CRNN) as our backbone network. In order to deal with a small amount of tagged data and a large amounts of unlabeled in-domain data, we aim to focus primarily on how to apply semi-supervise learning methods efficiently to make full use of limited data. Three semi-supervised learning principles have been used in our system, including: 1) Consistency regularization applies data augmentation; 2) MixUp regularizer requiring that the predictions for a interpolation of two inputs is close to the interpolation of the prediction for each individual input; 3) MixUp regularization applies to interpolation between data augmentations. We also tried an ensemble of various models, which are trained by using different semi-supervised learning principles. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 42.0\% compared to 25.8\% event-based f-measure of the baseline in the provided official evaluation dataset. Our submissions ranked third among 18 teams in the task 4.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HODGEPODGE, an ensemble method for sound event detection on the DCASE 2019 Task 4 dataset (weakly labeled, synthetic, and unlabeled domestic audio). It uses a CRNN backbone trained with three semi-supervised principles—consistency regularization through data augmentation, MixUp regularization, and MixUp applied to augmented inputs—plus ensembling of models trained under different principles. The central empirical claim is an event-based F-measure of 42.0% on the official held-out evaluation set, versus 25.8% for the baseline, placing third among 18 teams.
Significance. If the gains are shown to arise from the three semi-supervised principles rather than undisclosed tuning or ensemble averaging, the work would supply concrete evidence that consistency regularization and MixUp can be combined effectively with CRNNs on mixed supervision regimes in audio. The ensemble strategy itself is a pragmatic contribution for challenge settings, but the absence of isolating experiments limits claims about which components drive the 16.2-point lift and whether they transfer beyond this fixed evaluation set.
major comments (4)
- [experimental results / §4] The experimental results (abstract and §4) report only a single headline event-based F-measure of 42.0% with no ablation tables that isolate the contribution of consistency regularization, MixUp, MixUp-on-augmentations, or the ensemble. Without these, it is impossible to attribute the improvement over the 25.8% baseline to the three stated semi-supervised principles rather than other factors.
- [methods] The methods section provides no description of how hyperparameters were chosen (augmentation strength, MixUp alpha, consistency loss weight, learning-rate schedule, or ensemble weighting). This omission is load-bearing because the skeptic concern—that the lift may result from challenge-specific tuning rather than the principles themselves—cannot be evaluated from the given information.
- [experimental setup] It is not stated whether the 25.8% baseline is the official DCASE 2019 Task 4 baseline or a re-implementation using the same CRNN architecture and data splits. The relative improvement cannot be assessed without this clarification and without reporting the baseline's own training details.
- [results] Results are given as single point estimates with neither error bars, multiple random seeds, nor cross-validation across data splits. A 16.2-point gain on a fixed external set is therefore not shown to be robust, undermining the claim that the semi-supervised ensemble reliably improves performance.
minor comments (2)
- [abstract] The footnote referenced in the abstract for 'HODGEPODGE' is not reproduced or explained in the provided text.
- [methods] Notation for the three semi-supervised losses and their combination weights is introduced without explicit equations or a summary table, making the implementation details harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: [experimental results / §4] The experimental results (abstract and §4) report only a single headline event-based F-measure of 42.0% with no ablation tables that isolate the contribution of consistency regularization, MixUp, MixUp-on-augmentations, or the ensemble. Without these, it is impossible to attribute the improvement over the 25.8% baseline to the three stated semi-supervised principles rather than other factors.
Authors: We agree that ablation studies are needed to isolate contributions. The reported result uses the full ensemble of all three principles. In revision we will add incremental ablation experiments applying each principle to the baseline CRNN. revision: yes
-
Referee: [methods] The methods section provides no description of how hyperparameters were chosen (augmentation strength, MixUp alpha, consistency loss weight, learning-rate schedule, or ensemble weighting). This omission is load-bearing because the skeptic concern—that the lift may result from challenge-specific tuning rather than the principles themselves—cannot be evaluated from the given information.
Authors: We acknowledge the missing hyperparameter details. The revised manuscript will specify the values for augmentation strength, MixUp alpha, consistency loss weight, learning-rate schedule and ensemble weighting, together with the validation-based selection procedure. revision: yes
-
Referee: [experimental setup] It is not stated whether the 25.8% baseline is the official DCASE 2019 Task 4 baseline or a re-implementation using the same CRNN architecture and data splits. The relative improvement cannot be assessed without this clarification and without reporting the baseline's own training details.
Authors: The 25.8% figure is the official DCASE 2019 Task 4 baseline. Our CRNN follows the challenge architecture and data splits. We will state this explicitly and supply baseline training details in the revision. revision: yes
-
Referee: [results] Results are given as single point estimates with neither error bars, multiple random seeds, nor cross-validation across data splits. A 16.2-point gain on a fixed external set is therefore not shown to be robust, undermining the claim that the semi-supervised ensemble reliably improves performance.
Authors: Single-point estimates follow standard DCASE challenge practice on the fixed official test set. To demonstrate robustness we will add results from multiple random seeds with error bars in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical performance on external challenge data
full rationale
The paper describes an ensemble of CRNN models trained with consistency regularization, MixUp, and MixUp-on-augmentations on DCASE 2019 Task 4 data, then reports measured event-based F-measure (42.0% vs 25.8% baseline) on the official evaluation set. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. The headline result is an external measurement, not a quantity derived from the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Three semi-supervised learning principles... Consistency regularization applies data augmentation; MixUp regularizer... MixUp regularization applies to interpolation between data augmentations... ensemble of various models
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CRNN backbone... event-based f-measure of 42.0%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Detection and classification of acoustic sc enes and events: Outcome of the dcase 2016 challenge,
A. Mesaros, T. Heittola, E. Benetos, P . Foster, M. Lagran ge, T. Virtanen, and M. D. Plumbley, “Detection and classification of acoustic sc enes and events: Outcome of the dcase 2016 challenge,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp. 379–393, 2018
work page 2016
-
[2]
Dcase 2017 challenge setup: Tasks, datasets a nd baseline system,
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets a nd baseline system,” in DCASE 2017-W orkshop on Detection and Classification of Acou stic Scenes and Events, 2017
work page 2017
-
[3]
http://dcase.community/challenge2019/
-
[4]
Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments
R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P . Shah, “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” arXiv preprint arXiv:1807.10501, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Mean teacher convolution system for dcase 20 18 task 4,
L. JiaKai, “Mean teacher convolution system for dcase 20 18 task 4,” DCASE2018 Challenge, Tech. Rep., September 2018
work page 2018
-
[6]
Dcase 2018 challenge baseline with convolutional neural networks,
Q. Kong, T. Iqbal, Y . Xu, W . Wang, and M. D. Plumbley, “Dcase 2018 challenge baseline with convolutional neural networks,” arXiv preprint arXiv:1808.00773 , 2018
-
[7]
A. Tarvainen and H. V alpola, “Mean teachers are better ro le models: Weight- averaged consistency targets improve semi-supervised dee p learning results,” in Advances in neural information processing systems , 2017, pp. 1195–1204. 8
work page 2017
-
[8]
Interpolation consistency training for semi-supervised learning,
V . V erma, A. Lamb, J. Kannala, Y . Bengio, and D. Lopez-Paz , “Interpolation consistency training for semi-supervised learning,” arXiv preprint arXiv:1903.03825, 2019
-
[9]
arXiv preprint arXiv:1905.02249 (2019)
D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel, “Mixmatch: A holistic approach to semi-supervised learnin g,” arXiv preprint arXiv:1905.02249, 2019
-
[10]
Sou nd event detection in domestic environments with weakly labeled data and sound scape synthesis,
N. Turpault, R. Serizel, A. P . Shah, and J. Salamon, “Sou nd event detection in domestic environments with weakly labeled data and sound scape synthesis,” 2019
work page 2019
-
[11]
mixup: Beyond Empirical Risk Minimization
H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mix up: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Audio set: An ontology and human-l abeled dataset for audio events,
J. F. Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W . Law rence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-l abeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP). IEEE, 2017, pp. 776–780
work page 2017
-
[13]
Freesound datasets: a pl atform for the creation of open audio datasets,
E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. B ogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a pl atform for the creation of open audio datasets,” in Hu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 o ct 23-27; Suzhou, China.[Canada]: International Society for Musi...
work page 2017
-
[14]
G. Dekkers, S. Lauwereins, B. Thoen, M. W . Adhana, H. Bro uckxon, T. van Waterschoot, B. V anrumste, M. V erhelst, and P . Karsmakers,“The sins database for detection of daily activities in a home environment usin g an acoustic sensor network,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 W orkshop (DCASE2017), Mun...
work page 2017
-
[15]
Metrics for p olyphonic sound event detection,
A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for p olyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. 9
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.