Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification
Pith reviewed 2026-05-08 04:31 UTC · model grok-4.3
The pith
Training base models on four distinct data-split and granularity regimes lets a meta-learner reach 66.49 percent on ICBHI respiratory classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training base models on the ICBHI dataset using a fixed 80-20 percent split and five-fold cross-validation, each at both patient-level and sample-level granularity, then combining their outputs through a trained meta-model, the approach reaches a Score of 66.49 percent on the benchmark and improves generalization on two out-of-distribution datasets.
What carries the argument
The meta-model that learns to fuse predictions from the four base models trained under different split and granularity regimes.
If this is right
- Ensemble effectiveness increases because base predictions are less correlated without new data.
- The model maintains higher accuracy when test patients differ from those seen in training.
- The method requires no extra labeled recordings beyond the original ICBHI collection.
Where Pith is reading between the lines
- The same four-regime split tactic could be tested on other small-scale medical audio or signal tasks.
- If the meta-model adds little beyond simple averaging, a non-learned fusion rule might suffice and reduce complexity.
- Adding further split variants or different granularities could be checked to see whether gains continue or plateau.
Load-bearing premise
The four chosen combinations of split strategy and granularity produce enough independent error patterns for the meta-model to discover a stable weighting rule rather than fitting noise from the training subjects.
What would settle it
A new respiratory-sound dataset on which all four base-model families produce highly correlated errors on held-out patients would show whether the induced diversity is sufficient.
Figures
read the original abstract
Training reliable respiratory sound classification models remains challenging due to the limited size and subject diversity of datasets. Ensemble methods can improve robustness, but when base models are trained on identical data, models tend to overfit and produce highly correlated predictions, thereby reducing the effectiveness of ensembling. In this work, we investigate a meta-ensemble learning methodology that enhances prediction diversity by training base models on diverse data splits and combining their outputs through a trained meta-model. Specifically, we train base models on the ICBHI dataset using two data split settings: fixed 80-20% split and five-fold cross-validation split, under two data granularity settings: patient- and sample-level. The resulting diversity in base model predictions enables the meta-model to better generalize. Our approach achieves new state-of-the-art performance on the ICBHI benchmark, reaching a Score of 66.49% and showing improved generalization on two out-of-distribution datasets, indicating its potential applicability to real-world clinical data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a meta-ensemble approach for respiratory sound classification on the ICBHI dataset. Base models are trained under four combinations of data splits (fixed 80-20% vs. 5-fold cross-validation) and granularities (patient-level vs. sample-level) to induce prediction diversity; their outputs are then combined by a trained meta-model. The central claims are a new state-of-the-art Score of 66.49% on ICBHI together with improved generalization on two out-of-distribution datasets.
Significance. If the performance gains are shown to arise from the split-induced diversity rather than dataset-specific correlations, the method would provide a low-overhead route to more robust ensembles in small medical audio datasets. The explicit OOD evaluation is a positive feature for clinical relevance. The work is purely empirical and does not introduce new theoretical machinery.
major comments (3)
- Abstract: The claim of a new state-of-the-art Score of 66.49% is presented without any numerical values for prior SOTA methods, for the four individual base models, or for a standard ensemble baseline. This omission makes it impossible to determine the size or source of the reported improvement and is load-bearing for the central performance claim.
- Abstract and results sections: No ablation is reported that isolates the contribution of each split/granularity combination to the meta-model's OOD performance, nor are statistical significance tests (e.g., paired t-tests or McNemar) supplied for the claimed gains. Without these, the assertion that the four variants produce sufficiently diverse predictions for a generalizable fusion rule cannot be evaluated.
- Abstract: The meta-model architecture, training procedure, and input feature construction are not described. Because the meta-learner is the component that must discover a transferable combination rule, the absence of these details prevents assessment of whether the reported OOD improvement is reproducible or merely an artifact of ICBHI-specific error correlations.
minor comments (1)
- The term 'Score' is used without an explicit definition or reference to the ICBHI evaluation protocol; a one-sentence clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our results and methodology. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: Abstract: The claim of a new state-of-the-art Score of 66.49% is presented without any numerical values for prior SOTA methods, for the four individual base models, or for a standard ensemble baseline. This omission makes it impossible to determine the size or source of the reported improvement and is load-bearing for the central performance claim.
Authors: We agree that explicit numerical comparisons are necessary to substantiate the central claim. In the revised manuscript, we will update the abstract to report the scores of prior state-of-the-art methods, the four individual base models, and a standard ensemble baseline. A detailed comparison table will be added to the results section to quantify the magnitude and source of the improvement. revision: yes
-
Referee: Abstract and results sections: No ablation is reported that isolates the contribution of each split/granularity combination to the meta-model's OOD performance, nor are statistical significance tests (e.g., paired t-tests or McNemar) supplied for the claimed gains. Without these, the assertion that the four variants produce sufficiently diverse predictions for a generalizable fusion rule cannot be evaluated.
Authors: We acknowledge that ablations and statistical tests are required to rigorously support the role of split-induced diversity. We will add ablation studies isolating the contribution of each data split and granularity combination to OOD performance. Statistical significance tests, including McNemar's test, will be reported for the performance gains to confirm that the four variants yield sufficiently diverse predictions for effective meta-fusion. revision: yes
-
Referee: Abstract: The meta-model architecture, training procedure, and input feature construction are not described. Because the meta-learner is the component that must discover a transferable combination rule, the absence of these details prevents assessment of whether the reported OOD improvement is reproducible or merely an artifact of ICBHI-specific error correlations.
Authors: We will expand the methods section with a complete description of the meta-model architecture, training procedure, and input feature construction. Key details will be summarized in the abstract to enable readers to evaluate reproducibility and assess whether the OOD gains arise from a generalizable fusion rule rather than dataset-specific correlations. revision: yes
Circularity Check
No circularity: purely empirical meta-ensemble with no derivations or self-referential reductions
full rationale
The paper describes an empirical pipeline: base classifiers trained on ICBHI under four split/granularity combinations, followed by a meta-model trained on their prediction vectors. No equations, uniqueness theorems, ansatzes, or derivations are presented that could reduce to fitted quantities by construction. Performance claims rest on held-out ICBHI scores and separate OOD evaluations rather than any self-definition or load-bearing self-citation chain. The reader's assessment of score 2.0 is consistent with minor normal self-citation in an ML methods paper; no load-bearing step collapses to the inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Y . Kim, Y . Hyon, S. S. Jung, S. Lee, G. Yoo, C. Chung, and T. Ha, “Respiratory sound classification for crackles, wheezes, and rhonchi in the clinical field using deep learning,”Scientific reports, vol. 11, no. 1, pp. 1–11, 2021
work page 2021
-
[2]
S. Bae, J.-W. Kim, W.-Y . Cho, H. Baek, S. Son, B. Lee, C. Ha, K. Tae, S. Kim, and S.-Y . Yun, “Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification,” inProc. INTERSPEECH 2023, 2023, pp. 5436–5440
work page 2023
-
[3]
T. Xia, J. Han, and C. Mascolo, “Exploring machine learning for audio- based respiratory condition screening: A concise review of databases, methods, and open issues,”Experimental Biology and Medicine, vol. 247, no. 22, pp. 2053–2061, 2022
work page 2053
-
[4]
J.-W. Kim, S. Bae, W.-Y . Cho, B. Lee, and H.-Y . Jung, “Stethoscope- guided supervised contrastive learning for cross-domain adaptation on respiratory sound classification,” inICASSP 2024-2024 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1431–1435
work page 2024
-
[5]
J.-W. Kim, M. Toikkanen, A. Jalali, M. Kim, H.-J. Han, H. Kim, W. Shin, H.-Y . Jung, and K. Kim, “Adaptive metadata-guided supervised contrastive learning for domain adaptation on respiratory sound classi- fication,”IEEE Journal of Biomedical and Health Informatics, 2025
work page 2025
-
[6]
H. Koo, M. Toikkanen, Y . T. Kim, S. Y . Kim, and J.-W. Kim, “Em- powering multimodal respiratory sound classification with counterfactual adversarial debiasing for out-of-distribution robustness,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 14 967–14 971
work page 2026
-
[7]
A respiratory sound database for the development of auto- mated classification,
B. Rocha, D. Filos, L. Mendes, I. V ogiatzis, E. Perantoni, E. Kaimakamis, P. Natsiavas, A. Oliveira, C. J ´acome, A. Marques et al., “A respiratory sound database for the development of auto- mated classification,” inPrecision Medicine Powered by pHealth and Connected Health: ICBHI 2017, Thessaloniki, Greece, 18-21 November
work page 2017
-
[8]
Springer, 2018, pp. 33–37
work page 2018
-
[9]
Lung sound classification using co- tuning and stochastic normalization,
T. Nguyen and F. Pernkopf, “Lung sound classification using co- tuning and stochastic normalization,”IEEE Transactions on Biomedical Engineering, vol. 69, no. 9, pp. 2872–2882, 2022
work page 2022
-
[10]
Afen: Respira- tory disease classification using ensemble learning,
R. Nadkarni, E. Nikolakakis, and R. Marinescu, “Afen: Respira- tory disease classification using ensemble learning,”arXiv preprint arXiv:2405.05467, 2024
-
[11]
M. Toikkanen and J.-W. Kim, “Improving Respiratory Sound Classifi- cation with Architecture-Agnostic Knowledge Distillation from Ensem- bles,” inInterspeech 2025, 2025, pp. 1023–1027
work page 2025
-
[12]
Diversity and general- ization in neural network ensembles,
L. A. Ortega, R. Caba ˜nas, and A. Masegosa, “Diversity and general- ization in neural network ensembles,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 11 720–11 743
work page 2022
-
[13]
A unified theory of diversity in ensemble learning,
D. Wood, T. Mu, A. M. Webb, H. W. Reeve, M. Lujan, and G. Brown, “A unified theory of diversity in ensemble learning,”Journal of machine learning research, vol. 24, no. 359, pp. 1–49, 2023
work page 2023
-
[14]
C. Ju, A. Bibaut, and M. van der Laan, “The relative performance of ensemble methods with deep convolutional neural networks for image classification,”Journal of applied statistics, vol. 45, no. 15, pp. 2800– 2818, 2018
work page 2018
-
[15]
Diversity in search strategies for ensemble feature selection,
A. Tsymbal, M. Pechenizkiy, and P. Cunningham, “Diversity in search strategies for ensemble feature selection,”Information fusion, vol. 6, no. 1, pp. 83–98, 2005
work page 2005
-
[16]
Bts: Bridging text and sound modalities for metadata-aided respiratory sound classification,
J.-W. Kim, M. Toikkanen, Y . Choi, S.-E. Moon, and H.-Y . Jung, “Bts: Bridging text and sound modalities for metadata-aided respiratory sound classification,” inInterspeech 2024, 2024, pp. 1690–1694
work page 2024
-
[17]
Sprsound: Open-source sjtu paediatric respiratory sound database,
Q. Zhang, J. Zhang, J. Yuan, H. Huang, Y . Zhang, B. Zhang, G. Lv, S. Lin, N. Wang, X. Liuet al., “Sprsound: Open-source sjtu paediatric respiratory sound database,”IEEE Transactions on Biomedical Circuits and Systems, vol. 16, no. 5, pp. 867–881, 2022
work page 2022
-
[18]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[19]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[20]
Efficientnet: Rethinking model scaling for con- volutional neural networks,
M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- volutional neural networks,” inInternational conference on machine learning. PMLR, 2019, pp. 6105–6114
work page 2019
-
[21]
Panns: Large-scale pretrained audio neural networks for audio pattern recognition,
Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020
work page 2020
-
[22]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255
work page 2009
-
[23]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inProc. IEEE ICASSP 2017, New Orleans, LA, 2017
work page 2017
-
[24]
AST: Audio Spectrogram Trans- former,
Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Trans- former,” inProc. Interspeech 2021, 2021, pp. 571–575
work page 2021
-
[25]
Specaugment: A simple data augmentation method for automatic speech recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,”Interspeech 2019, Sep 2019
work page 2019
-
[26]
Repaug- ment: Input-agnostic representation-level augmentation for respiratory sound classification,
J.-W. Kim, M. Toikkanen, S. Bae, M. Kim, and H.-Y . Jung, “Repaug- ment: Input-agnostic representation-level augmentation for respiratory sound classification,” in2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2024, pp. 1–6
work page 2024
-
[27]
Adversarial fine-tuning using generated respiratory sound to address class imbal- ance,
J.-W. Kim, C. Yoon, M. Toikkanen, S. Bae, and H.-Y . Jung, “Adversarial fine-tuning using generated respiratory sound to address class imbal- ance,”arXiv preprint arXiv:2311.06480, 2023
-
[28]
Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[29]
Masked modeling duo: Towards a universal audio pre-training frame- work,
D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Towards a universal audio pre-training frame- work,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024
work page 2024
-
[30]
Towards open respiratory acoustic foundation models: Pretraining and benchmarking,
Y . Zhang, T. Xia, J. Han, Y . Wu, G. Rizos, Y . Liu, M. Mosuily, J. Chauhan, and C. Mascolo, “Towards open respiratory acoustic foundation models: Pretraining and benchmarking,” in The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. [Online]. Available: https://openreview.net/forum?id=vXnGXRbOfb
work page 2024
-
[31]
L. Breiman, “Bagging predictors,”Machine learning, vol. 24, pp. 123– 140, 1996
work page 1996
-
[32]
A decision-theoretic generalization of on-line learning and an application to boosting,
Y . Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,”Journal of computer and system sciences, vol. 55, no. 1, pp. 119–139, 1997
work page 1997
-
[33]
D. H. Wolpert, “Stacked generalization,”Neural networks, vol. 5, no. 2, pp. 241–259, 1992
work page 1992
-
[34]
Towards inference efficient deep ensemble learning,
Z. Li, K. Ren, Y . Yang, X. Jiang, Y . Yang, and D. Li, “Towards inference efficient deep ensemble learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7, 2023, pp. 8711– 8719
work page 2023
-
[35]
Ensemble methods in machine learning,
T. G. Dietterich, “Ensemble methods in machine learning,” inInterna- tional workshop on multiple classifier systems. Springer, 2000, pp. 1–15
work page 2000
-
[36]
Tri-mtl: A triple multitask learning approach for respiratory disease diagnosis,
J.-W. Kim, S. Lee, M. Toikkanen, D. Hwang, and K. Kim, “Tri-mtl: A triple multitask learning approach for respiratory disease diagnosis,” in 2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2025, pp. 1–6
work page 2025
-
[37]
Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis,
Y . Chang, Z. Ren, T. T. Nguyen, W. Nejdl, and B. W. Schuller, “Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis,” inProc. Interspeech 2022, 2022, pp. 4003–4007
work page 2022
-
[38]
Adventitious respiratory classification using attentive residual neural networks,
Z. Yang, S. Liu, M. Song, E. Parada-Cabaleiro, and B. W. Schuller, “Adventitious respiratory classification using attentive residual neural networks,” inInterspeech, 2020
work page 2020
-
[39]
Y . Ma, X. Xu, and Y . Li, “Lungrn+ nl: An improved adventitious lung sound classification using non-local block resnet neural network with mixup data augmentation.” inInterspeech, 2020, pp. 2902–2906
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.