HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods

Anyan Shi; Huibin Lin; Liu Liu; Rujie Liu; Ziqiang Shi

arxiv: 1907.07398 · v1 · pith:VHJFU2W4new · submitted 2019-07-17 · 💻 cs.SD · eess.AS

HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods

Ziqiang Shi , Liu Liu , Huibin Lin , Rujie Liu , Anyan Shi This is my paper

Pith reviewed 2026-05-24 20:16 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords sound event detectionsemi-supervised learningCRNNensemble methodsconsistency regularizationMixUpweakly labeled dataDCASE 2019

0 comments

The pith

An ensemble of consistency-regularized and MixUp-trained CRNN models detects domestic sound events with 42.0% event-based F-measure, up from the 25.8% baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that three semi-supervised learning principles can be combined with a convolutional recurrent neural network to better exploit weakly labeled, synthetic, and unlabeled audio data for sound event detection. The principles are consistency regularization through data augmentation, a MixUp regularizer that requires predictions on interpolated inputs to match interpolated predictions, and MixUp regularization applied to data augmentations. Separate models are trained under each principle and then ensembled. This yields a large improvement over the baseline on the official DCASE 2019 Task 4 evaluation set. A sympathetic reader would care because most real-world sound detection problems have far more unlabeled recordings than carefully annotated ones, so methods that leverage the unlabeled data can reduce the cost of building effective systems.

Core claim

By training an ensemble of CRNN models, each using one of three semi-supervised principles (consistency regularization with data augmentation, MixUp regularization, and MixUp on data augmentations) on a mix of weakly labeled, synthetic, and unlabeled domestic sound data, the HODGEPODGE approach achieves an event-based F-measure of 42.0% on the DCASE 2019 Task 4 evaluation dataset, compared to 25.8% for the baseline.

What carries the argument

Ensemble of CRNN models each trained with one of three semi-supervised principles (consistency regularization via data augmentation, MixUp regularizer, MixUp on augmentations) to use weakly labeled plus unlabeled data.

If this is right

The approach makes fuller use of large volumes of unlabeled in-domain recordings than supervised training alone.
Models trained under each of the three principles capture complementary signals that improve results when ensembled.
The method is directly applicable to the DCASE 2019 Task 4 data setting of domestic environments with weak and synthetic labels.
Event-based F-measure on the official evaluation set rises from 25.8% to 42.0% under this training regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency and MixUp ensemble could be applied to other audio tasks that combine weak labels with abundant unlabeled recordings, such as environmental sound classification.
Results may depend on the specific ratio of weakly labeled to unlabeled data present in the DCASE collection, so performance on differently balanced corpora would test generality.
If the gains require extensive per-challenge tuning, the principles would need explicit robustness checks before use in new acoustic environments.

Load-bearing premise

The observed performance gain comes from the three semi-supervised principles and their ensemble rather than from challenge-specific hyperparameter tuning or fitting to evaluation artifacts.

What would settle it

Retraining the same ensemble architecture and principles on a new domestic audio collection with the same mix of weak, synthetic, and unlabeled data but a different test set, and measuring no improvement over a plain supervised CRNN, would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.07398 by Anyan Shi, Huibin Lin, Liu Liu, Rujie Liu, Ziqiang Shi.

**Figure 2.** Figure 2: Architecture of a GLU. Similar to LSTMs, GLUs play the role of controlling the information passed on in the hierarchy. This special gating mechanism allows us to effectively capture longrange context dependencies by deepening layers without encountering the problem of vanishing gradient. For the seven gated convolutional layers, the kernel sizes are 3, the paddings are 1, the strides are 1, and the number… view at source ↗

read the original abstract

In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge Task 4: Sound event detection in domestic environments. To perform this task, we adopted the convolutional recurrent neural networks (CRNN) as our backbone network. In order to deal with a small amount of tagged data and a large amounts of unlabeled in-domain data, we aim to focus primarily on how to apply semi-supervise learning methods efficiently to make full use of limited data. Three semi-supervised learning principles have been used in our system, including: 1) Consistency regularization applies data augmentation; 2) MixUp regularizer requiring that the predictions for a interpolation of two inputs is close to the interpolation of the prediction for each individual input; 3) MixUp regularization applies to interpolation between data augmentations. We also tried an ensemble of various models, which are trained by using different semi-supervised learning principles. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 42.0\% compared to 25.8\% event-based f-measure of the baseline in the provided official evaluation dataset. Our submissions ranked third among 18 teams in the task 4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical 16-point lift on DCASE 2019 Task 4 by ensembling three standard semi-supervised tricks on a CRNN, but without ablations the source of the gain stays unclear.

read the letter

The headline result is a jump from 25.8% to 42.0% event-based F-measure on the official DCASE 2019 Task 4 evaluation set by taking consistency regularization via augmentation, MixUp, and MixUp on augmentations, training separate CRNN models with each, and averaging them. That combination is not new in principle, but the authors show it can be stacked and ensembled for this specific mix of weakly labeled, synthetic, and unlabeled domestic audio data, and they place third out of 18 teams. The work is useful as an engineering recipe for people already running similar challenges.

Referee Report

4 major / 2 minor

Summary. The paper presents HODGEPODGE, an ensemble method for sound event detection on the DCASE 2019 Task 4 dataset (weakly labeled, synthetic, and unlabeled domestic audio). It uses a CRNN backbone trained with three semi-supervised principles—consistency regularization through data augmentation, MixUp regularization, and MixUp applied to augmented inputs—plus ensembling of models trained under different principles. The central empirical claim is an event-based F-measure of 42.0% on the official held-out evaluation set, versus 25.8% for the baseline, placing third among 18 teams.

Significance. If the gains are shown to arise from the three semi-supervised principles rather than undisclosed tuning or ensemble averaging, the work would supply concrete evidence that consistency regularization and MixUp can be combined effectively with CRNNs on mixed supervision regimes in audio. The ensemble strategy itself is a pragmatic contribution for challenge settings, but the absence of isolating experiments limits claims about which components drive the 16.2-point lift and whether they transfer beyond this fixed evaluation set.

major comments (4)

[experimental results / §4] The experimental results (abstract and §4) report only a single headline event-based F-measure of 42.0% with no ablation tables that isolate the contribution of consistency regularization, MixUp, MixUp-on-augmentations, or the ensemble. Without these, it is impossible to attribute the improvement over the 25.8% baseline to the three stated semi-supervised principles rather than other factors.
[methods] The methods section provides no description of how hyperparameters were chosen (augmentation strength, MixUp alpha, consistency loss weight, learning-rate schedule, or ensemble weighting). This omission is load-bearing because the skeptic concern—that the lift may result from challenge-specific tuning rather than the principles themselves—cannot be evaluated from the given information.
[experimental setup] It is not stated whether the 25.8% baseline is the official DCASE 2019 Task 4 baseline or a re-implementation using the same CRNN architecture and data splits. The relative improvement cannot be assessed without this clarification and without reporting the baseline's own training details.
[results] Results are given as single point estimates with neither error bars, multiple random seeds, nor cross-validation across data splits. A 16.2-point gain on a fixed external set is therefore not shown to be robust, undermining the claim that the semi-supervised ensemble reliably improves performance.

minor comments (2)

[abstract] The footnote referenced in the abstract for 'HODGEPODGE' is not reproduced or explained in the provided text.
[methods] Notation for the three semi-supervised losses and their combination weights is introduced without explicit equations or a summary table, making the implementation details harder to follow.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate revisions planned for the next manuscript version.

read point-by-point responses

Referee: [experimental results / §4] The experimental results (abstract and §4) report only a single headline event-based F-measure of 42.0% with no ablation tables that isolate the contribution of consistency regularization, MixUp, MixUp-on-augmentations, or the ensemble. Without these, it is impossible to attribute the improvement over the 25.8% baseline to the three stated semi-supervised principles rather than other factors.

Authors: We agree that ablation studies are needed to isolate contributions. The reported result uses the full ensemble of all three principles. In revision we will add incremental ablation experiments applying each principle to the baseline CRNN. revision: yes
Referee: [methods] The methods section provides no description of how hyperparameters were chosen (augmentation strength, MixUp alpha, consistency loss weight, learning-rate schedule, or ensemble weighting). This omission is load-bearing because the skeptic concern—that the lift may result from challenge-specific tuning rather than the principles themselves—cannot be evaluated from the given information.

Authors: We acknowledge the missing hyperparameter details. The revised manuscript will specify the values for augmentation strength, MixUp alpha, consistency loss weight, learning-rate schedule and ensemble weighting, together with the validation-based selection procedure. revision: yes
Referee: [experimental setup] It is not stated whether the 25.8% baseline is the official DCASE 2019 Task 4 baseline or a re-implementation using the same CRNN architecture and data splits. The relative improvement cannot be assessed without this clarification and without reporting the baseline's own training details.

Authors: The 25.8% figure is the official DCASE 2019 Task 4 baseline. Our CRNN follows the challenge architecture and data splits. We will state this explicitly and supply baseline training details in the revision. revision: yes
Referee: [results] Results are given as single point estimates with neither error bars, multiple random seeds, nor cross-validation across data splits. A 16.2-point gain on a fixed external set is therefore not shown to be robust, undermining the claim that the semi-supervised ensemble reliably improves performance.

Authors: Single-point estimates follow standard DCASE challenge practice on the fixed official test set. To demonstrate robustness we will add results from multiple random seeds with error bars in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance on external challenge data

full rationale

The paper describes an ensemble of CRNN models trained with consistency regularization, MixUp, and MixUp-on-augmentations on DCASE 2019 Task 4 data, then reports measured event-based F-measure (42.0% vs 25.8% baseline) on the official evaluation set. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. The headline result is an external measurement, not a quantity derived from the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The work rests on the standard assumption that CRNNs are appropriate for audio sequences and that the DCASE evaluation protocol is a faithful measure of real-world performance.

pith-pipeline@v0.9.0 · 5798 in / 1153 out tokens · 23035 ms · 2026-05-24T20:16:32.510380+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Three semi-supervised learning principles... Consistency regularization applies data augmentation; MixUp regularizer... MixUp regularization applies to interpolation between data augmentations... ensemble of various models
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CRNN backbone... event-based f-measure of 42.0%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Detection and classiﬁcation of acoustic sc enes and events: Outcome of the dcase 2016 challenge,

A. Mesaros, T. Heittola, E. Benetos, P . Foster, M. Lagran ge, T. Virtanen, and M. D. Plumbley, “Detection and classiﬁcation of acoustic sc enes and events: Outcome of the dcase 2016 challenge,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp. 379–393, 2018

work page 2016
[2]

Dcase 2017 challenge setup: Tasks, datasets a nd baseline system,

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets a nd baseline system,” in DCASE 2017-W orkshop on Detection and Classiﬁcation of Acou stic Scenes and Events, 2017

work page 2017
[3]

http://dcase.community/challenge2019/

work page
[4]

Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P . Shah, “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” arXiv preprint arXiv:1807.10501, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Mean teacher convolution system for dcase 20 18 task 4,

L. JiaKai, “Mean teacher convolution system for dcase 20 18 task 4,” DCASE2018 Challenge, Tech. Rep., September 2018

work page 2018
[6]

Dcase 2018 challenge baseline with convolutional neural networks,

Q. Kong, T. Iqbal, Y . Xu, W . Wang, and M. D. Plumbley, “Dcase 2018 challenge baseline with convolutional neural networks,” arXiv preprint arXiv:1808.00773 , 2018

work page arXiv 2018
[7]

Mean teachers are better ro le models: Weight- averaged consistency targets improve semi-supervised dee p learning results,

A. Tarvainen and H. V alpola, “Mean teachers are better ro le models: Weight- averaged consistency targets improve semi-supervised dee p learning results,” in Advances in neural information processing systems , 2017, pp. 1195–1204. 8

work page 2017
[8]

Interpolation consistency training for semi-supervised learning,

V . V erma, A. Lamb, J. Kannala, Y . Bengio, and D. Lopez-Paz , “Interpolation consistency training for semi-supervised learning,” arXiv preprint arXiv:1903.03825, 2019

work page arXiv 1903
[9]

arXiv preprint arXiv:1905.02249 (2019)

D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel, “Mixmatch: A holistic approach to semi-supervised learnin g,” arXiv preprint arXiv:1905.02249, 2019

work page arXiv 1905
[10]

Sou nd event detection in domestic environments with weakly labeled data and sound scape synthesis,

N. Turpault, R. Serizel, A. P . Shah, and J. Salamon, “Sou nd event detection in domestic environments with weakly labeled data and sound scape synthesis,” 2019

work page 2019
[11]

mixup: Beyond Empirical Risk Minimization

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mix up: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Audio set: An ontology and human-l abeled dataset for audio events,

J. F. Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W . Law rence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-l abeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017
[13]

Freesound datasets: a pl atform for the creation of open audio datasets,

E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. B ogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a pl atform for the creation of open audio datasets,” in Hu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 o ct 23-27; Suzhou, China.[Canada]: International Society for Musi...

work page 2017
[14]

The sins database for detection of daily activities in a home environment usin g an acoustic sensor network,

G. Dekkers, S. Lauwereins, B. Thoen, M. W . Adhana, H. Bro uckxon, T. van Waterschoot, B. V anrumste, M. V erhelst, and P . Karsmakers,“The sins database for detection of daily activities in a home environment usin g an acoustic sensor network,” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop (DCASE2017), Mun...

work page 2017
[15]

Metrics for p olyphonic sound event detection,

A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for p olyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. 9

work page 2016

[1] [1]

Detection and classiﬁcation of acoustic sc enes and events: Outcome of the dcase 2016 challenge,

A. Mesaros, T. Heittola, E. Benetos, P . Foster, M. Lagran ge, T. Virtanen, and M. D. Plumbley, “Detection and classiﬁcation of acoustic sc enes and events: Outcome of the dcase 2016 challenge,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp. 379–393, 2018

work page 2016

[2] [2]

Dcase 2017 challenge setup: Tasks, datasets a nd baseline system,

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets a nd baseline system,” in DCASE 2017-W orkshop on Detection and Classiﬁcation of Acou stic Scenes and Events, 2017

work page 2017

[3] [3]

http://dcase.community/challenge2019/

work page

[4] [4]

Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P . Shah, “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” arXiv preprint arXiv:1807.10501, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Mean teacher convolution system for dcase 20 18 task 4,

L. JiaKai, “Mean teacher convolution system for dcase 20 18 task 4,” DCASE2018 Challenge, Tech. Rep., September 2018

work page 2018

[6] [6]

Dcase 2018 challenge baseline with convolutional neural networks,

Q. Kong, T. Iqbal, Y . Xu, W . Wang, and M. D. Plumbley, “Dcase 2018 challenge baseline with convolutional neural networks,” arXiv preprint arXiv:1808.00773 , 2018

work page arXiv 2018

[7] [7]

Mean teachers are better ro le models: Weight- averaged consistency targets improve semi-supervised dee p learning results,

A. Tarvainen and H. V alpola, “Mean teachers are better ro le models: Weight- averaged consistency targets improve semi-supervised dee p learning results,” in Advances in neural information processing systems , 2017, pp. 1195–1204. 8

work page 2017

[8] [8]

Interpolation consistency training for semi-supervised learning,

V . V erma, A. Lamb, J. Kannala, Y . Bengio, and D. Lopez-Paz , “Interpolation consistency training for semi-supervised learning,” arXiv preprint arXiv:1903.03825, 2019

work page arXiv 1903

[9] [9]

arXiv preprint arXiv:1905.02249 (2019)

D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel, “Mixmatch: A holistic approach to semi-supervised learnin g,” arXiv preprint arXiv:1905.02249, 2019

work page arXiv 1905

[10] [10]

Sou nd event detection in domestic environments with weakly labeled data and sound scape synthesis,

N. Turpault, R. Serizel, A. P . Shah, and J. Salamon, “Sou nd event detection in domestic environments with weakly labeled data and sound scape synthesis,” 2019

work page 2019

[11] [11]

mixup: Beyond Empirical Risk Minimization

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mix up: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Audio set: An ontology and human-l abeled dataset for audio events,

J. F. Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W . Law rence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-l abeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017

[13] [13]

Freesound datasets: a pl atform for the creation of open audio datasets,

E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. B ogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a pl atform for the creation of open audio datasets,” in Hu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 o ct 23-27; Suzhou, China.[Canada]: International Society for Musi...

work page 2017

[14] [14]

The sins database for detection of daily activities in a home environment usin g an acoustic sensor network,

G. Dekkers, S. Lauwereins, B. Thoen, M. W . Adhana, H. Bro uckxon, T. van Waterschoot, B. V anrumste, M. V erhelst, and P . Karsmakers,“The sins database for detection of daily activities in a home environment usin g an acoustic sensor network,” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop (DCASE2017), Mun...

work page 2017

[15] [15]

Metrics for p olyphonic sound event detection,

A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for p olyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. 9

work page 2016