Optimising MFCC parameters for the automatic detection of respiratory diseases
Pith reviewed 2026-05-23 21:50 UTC · model grok-4.3
The pith
Optimizing the number of MFCC coefficients, frame length, and hop length raises SVM accuracy for respiratory disease detection by 14.9 to 19.6 percent over worst-case parameter choices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By systematically varying the number of MFCC coefficients, frame length, and hop length on four respiratory sound datasets and feeding the features to an SVM classifier, the study finds that the best parameter combination reaches 81.1 percent accuracy on the Cambridge COVID-19 Sound database, 80.6 percent on Coswara, and 71.7 percent on the SVD dataset. These figures represent gains of 19.6 percent, 16.1 percent, and 14.9 percent over the worst parameter combinations tested on each dataset. Additional patterns include declining accuracy with increasing hop length, an optimum near 30 coefficients, and opposite frame-length trends between the COVID-19 datasets and the SVD set.
What carries the argument
The three MFCC extraction parameters (number of coefficients, frame length, hop length) that control the acoustic feature vectors supplied to the SVM classifier for respiratory condition labeling.
If this is right
- Accuracy falls steadily as hop length grows.
- Roughly 30 coefficients give the highest performance across the tested sets.
- Longer frame lengths hurt results on the two COVID-19 datasets but help on the SVD dataset.
- The best parameter triple outperforms the worst by double-digit percentage points on each dataset.
Where Pith is reading between the lines
- If MFCC parameter choice matters this much, similar tuning sweeps could lift results in other audio-health tasks such as cough or snoring analysis.
- Libraries that ship default MFCC settings may be handing researchers suboptimal features for medical sound work.
- Future studies could test whether the same parameter trends appear when the classifier is swapped for a neural network.
Load-bearing premise
That the measured accuracy differences come only from the three MFCC parameters and are not produced by unstated choices in preprocessing, normalization, train-test splits, or missing statistical tests.
What would settle it
Re-running the exact same datasets and SVM with the reported best and worst parameter sets under documented cross-validation and error bars; if the accuracy gaps shrink below a few percent or reverse, the central claim does not hold.
Figures
read the original abstract
Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that MFCC parameters (number of coefficients, frame length, hop length) substantially affect SVM accuracy for respiratory disease detection. It reports directional effects (accuracy falls with larger hop length; ~30 coefficients optimal; frame-length effects differ by dataset) and that an optimized triple yields 19.6 pp, 16.1 pp and 14.9 pp gains (to 81.1 %, 80.6 % and 71.7 %) on the Cambridge COVID-19, Coswara and SVD datasets relative to the worst triple examined.
Significance. If the optimization procedure is shown to be robust, the empirical results across three datasets would supply practical guidance on MFCC tuning for acoustic respiratory biomarkers. The direct measurement on held-out data and the reporting of consistent directional trends are positive features; however, the absence of any validation protocol for selecting the reported optimum currently prevents the quantitative gains from being treated as reliable.
major comments (2)
- [Abstract] Abstract: the headline claim of 19.6/16.1/14.9 pp improvements from an 'optimized combination' is presented without any description of how that combination was identified (held-out validation split, nested CV, grid-search protocol, or multiple-testing correction). This is load-bearing because the largest accuracy observed on a finite grid evaluated on the same data used for selection is expected to be upward-biased, directly undermining the reported gains.
- [Abstract] Abstract / Results: no information is supplied on the train-test split protocol, cross-validation procedure, number of runs, or statistical significance of the accuracy differences. Without these controls it is impossible to determine whether the observed differences arise solely from the three MFCC parameters or are confounded by unstated preprocessing, normalization or split choices.
minor comments (1)
- [Abstract] The abstract mentions a fourth dataset (TACTICAS) but reports quantitative results only for three; a brief statement of its outcome or reason for omission would improve completeness.
Simulated Author's Rebuttal
We thank the referee for the detailed comments on experimental reporting. We agree that the abstract and results lack necessary protocol details and will revise the manuscript to include them. The directional trends (hop length, coefficient count, dataset-dependent frame length) remain the core contribution and are unaffected by the reporting gaps.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 19.6/16.1/14.9 pp improvements from an 'optimized combination' is presented without any description of how that combination was identified (held-out validation split, nested CV, grid-search protocol, or multiple-testing correction). This is load-bearing because the largest accuracy observed on a finite grid evaluated on the same data used for selection is expected to be upward-biased, directly undermining the reported gains.
Authors: We agree the selection procedure must be stated explicitly. The optimized triple was obtained by exhaustive enumeration of the three-parameter grid and selection of the combination that maximized accuracy on a fixed held-out test partition (stratified 80/20 split, same partition used for all reported triples). We will revise the abstract to read: 'Using grid search over the MFCC parameter space and evaluation on a held-out test set, the best triple yields accuracies of 81.1 %, 80.6 % and 71.7 % ...' We will also add a short discussion of selection bias and report mean accuracy plus standard deviation across five independent random splits to quantify variability. revision: yes
-
Referee: [Abstract] Abstract / Results: no information is supplied on the train-test split protocol, cross-validation procedure, number of runs, or statistical significance of the accuracy differences. Without these controls it is impossible to determine whether the observed differences arise solely from the three MFCC parameters or are confounded by unstated preprocessing, normalization or split choices.
Authors: We acknowledge the omission. All experiments used a consistent stratified 80/20 train-test split with no data leakage; SVM hyperparameters were tuned via 5-fold cross-validation on the training portion only. We will insert a concise 'Experimental Protocol' paragraph in the methods and results sections stating the split ratio, that the identical split was reused across all MFCC triples, the number of random seeds (five), and paired t-test p-values for the accuracy differences. This will confirm that the reported gains are attributable to the MFCC parameters under controlled conditions. revision: yes
Circularity Check
No circularity: empirical grid search over MFCC parameters evaluated directly on datasets
full rationale
The paper conducts an experimental study: it extracts MFCC features under varying (n_coeffs, frame_len, hop_len) combinations, trains SVM classifiers on four respiratory-disease datasets, and reports measured accuracies. No equations, derivations, or self-referential definitions appear; the reported improvements (19.6 pp, 16.1 pp, 14.9 pp) are direct empirical outcomes on the data, not quantities forced by construction or by self-citation chains. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- number of MFCC coefficients
- frame length
- hop length
axioms (1)
- domain assumption SVM is an effective and widely adopted classifier for audio-based respiratory classification
Reference graph
Works this paper leans on
-
[1]
Sheffield: European Respiratory Society, 2017
Forum of International Respiratory Societies, The Global Impact of Respiratory Disease Second Edition. Sheffield: European Respiratory Society, 2017
work page 2017
-
[2]
The evolution of human speech: The role of enhanced breathing control,
A. M. MacLarnon and G. P . Hewitt, “The evolution of human speech: The role of enhanced breathing control,” American Journal of Physical Anthropology: The Official Publication of the American Association of Physical Anthropologists , vol. 109, no. 3, pp. 341–363, 1999
work page 1999
-
[3]
Preliminary observation of speech disorder in obstructive and mixed sleep apnea,
P . K. Monoson and A. W. Fox, “Preliminary observation of speech disorder in obstructive and mixed sleep apnea,” Chest, vol. 92, no. 4, pp. 670–675, 1987
work page 1987
-
[4]
Detection of covid-19 through the analysis of vocal fold oscillations,
M. Al Ismail, S. Deshmukh, and R. Singh, “Detection of covid-19 through the analysis of vocal fold oscillations,” in ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 1035–1039
work page 2021
-
[5]
Taking connected mobile-health diagnostics of infectious diseases to the field,
C. S. Wood, M. R. Thomas, J. Budd, T. P . Mashamba-Thompson, K. Herbst, D. Pillay, R. W. Peeling, A. M. Johnson, R. A. McKendry, and M. M. Stevens, “Taking connected mobile-health diagnostics of infectious diseases to the field,” Nature, vol. 566, no. 7745, pp. 467–474, 2019
work page 2019
-
[6]
X. Jiang, M. Coffee, A. Bari, J. Wang, X. Jiang, J. Huang, J. Shi, J. Dai, J. Cai, T. Zhang et al. , “Towards an artificial intelli- gence framework for data-driven prediction of coronavirus clinical severity,” Computers, Materials & Continua , vol. 63, no. 1, pp. 537– 551, 2020
work page 2020
-
[7]
Fast deep learning computer-aided diagnosis of covid-19 based on digital chest x-ray images,
M. A. Al-Antari, C.-H. Hua, J. Bang, and S. Lee, “Fast deep learning computer-aided diagnosis of covid-19 based on digital chest x-ray images,” Applied Intelligence, vol. 51, no. 5, pp. 2890– 2907, 2021
work page 2021
-
[8]
T. Xia, J. Han, and C. Mascolo, “Exploring machine learning for audio-based respiratory condition screening: A concise review of databases, methods, and open issues,” Experimental Biology and Medicine, vol. 247, no. 22, pp. 2053–2061, 2022
work page 2053
-
[9]
Resapp technology to diagnose and manage respira- tory disease,
T. Keating, “Resapp technology to diagnose and manage respira- tory disease,” Australasian Biotechnology, vol. 25, no. 1, p. 16, 2015
work page 2015
-
[10]
B. W. Schuller, D. M. Schuller, K. Qian, J. Liu, H. Zheng, and X. Li, “Covid-19 and computer audition: An overview on what speech & sound analysis could contribute in the sars-cov-2 corona crisis,” Frontiers in digital health, vol. 3, p. 564906, 2021
work page 2021
-
[11]
Smartphone apps in the covid-19 pandemic,
J. A. Pandit, J. M. Radin, G. Quer, and E. J. Topol, “Smartphone apps in the covid-19 pandemic,” Nature Biotechnology , vol. 40, no. 7, pp. 1013–1022, 2022
work page 2022
-
[12]
M. Wei, J. Du, X. Wang, H. Lu, W. Wang, and P . Lin, “Voice disor- ders in severe obstructive sleep apnea patients and comparison of two acoustic analysis software programs: Mdvp and praat,” Sleep and Breathing, vol. 25, pp. 433–439, 2021
work page 2021
-
[13]
Respiratory disease classifi- cation by cnn using mfcc,
K. Mridha, S. Sarkar, and D. Kumar, “Respiratory disease classifi- cation by cnn using mfcc,” in 2021 IEEE 6th International Conference on Computing, Communication and Automation (ICCCA) . IEEE, 2021, pp. 517–523
work page 2021
-
[14]
Aenet: Learning deep audio features for video analysis,
N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audio features for video analysis,”IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 513–524, 2017
work page 2017
-
[15]
Feature ex- traction of some quranic recitation using mel-frequency cepstral coeficients (mfcc),
M. Bezoui, A. Elmoutaouakkil, and A. Beni-hssane, “Feature ex- traction of some quranic recitation using mel-frequency cepstral coeficients (mfcc),” in 2016 5th international conference on multimedia computing and systems (ICMCS). IEEE, 2016, pp. 127–131
work page 2016
-
[16]
S. Davis and P . Mermelstein, “Comparison of parametric rep- resentations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980
work page 1980
-
[17]
Using ai to predict service agent stress from emotion patterns in service interactions,
S. Bromuri, A. P . Henkel, D. Iren, and V . Urovi, “Using ai to predict service agent stress from emotion patterns in service interactions,” Journal of Service Management, vol. 32, no. 4, pp. 581–611, 2021
work page 2021
-
[18]
Learnable mfccs for speaker verification,
X. Liu, M. Sahidullah, and T. Kinnunen, “Learnable mfccs for speaker verification,” in 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5
work page 2021
-
[19]
Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data,
C. Brown, J. Chauhan, A. Grammenos, J. Han, A. Hasthanasombat, D. Spathis, T. Xia, P . Cicuta, and C. Mascolo, “Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data,” arXiv preprint arXiv:2006.05919, 2020
-
[20]
The effect of the mfcc frame length in automatic voice pathology detection,
S. Tirronen, S. R. Kadiri, and P . Alku, “The effect of the mfcc frame length in automatic voice pathology detection,” Journal of Voice , 2022
work page 2022
-
[21]
An analytical study of speech pathology detection based on mfcc and deep neural networks,
M. Zakariah, Y. Ajmi Alotaibi, Y. Guo, K. Tran-Trung, M. M. Elahi et al., “An analytical study of speech pathology detection based on mfcc and deep neural networks,” Computational and Mathematical Methods in Medicine, vol. 2022, 2022
work page 2022
-
[22]
J. Naeem, N. A. Hamzaid, M. A. Islam, A. W. Azman, and M. Bijak, “Mechanomyography-based muscle fatigue detection during elec- trically elicited cycling in patients with spinal cord injury,”Medical & biological engineering & computing, vol. 57, pp. 1199–1211, 2019
work page 2019
-
[23]
Feature extrac- tion using mfcc,
S. Gupta, J. Jaafar, W. W. Ahmad, and A. Bansal, “Feature extrac- tion using mfcc,” Signal & Image Processing: An International Journal, vol. 4, no. 4, pp. 101–108, 2013
work page 2013
-
[24]
L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni- tion algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[25]
K. Phua, J. Chen, T. H. Dat, and L. Shue, “Heart sound as a biometric,” Pattern recognition, vol. 41, no. 3, pp. 906–919, 2008
work page 2008
-
[26]
Coswara–a database of breath- ing, cough, and voice sounds for covid-19 diagnosis,
N. Sharma, P . Krishnan, R. Kumar, S. Ramoji, S. R. Chetupalli, P . K. Ghosh, S. Ganapathy et al. , “Coswara–a database of breathing, cough, and voice sounds for covid-19 diagnosis,” arXiv preprint arXiv:2005.10548, 2020
-
[27]
D. Mart ´ınez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “Voice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,” in Advances in Speech and Language Technologies for Iberian Languages: IberSPEECH 2012 Conference, Madrid, Spain, November 21-23, 2012. Proceedings. Springer, 2012,...
work page 2012
-
[28]
Telemonitoring for asthma and copd through voice analysis: the tacticas study
“Telemonitoring for asthma and copd through voice analysis: the tacticas study.” [Online]. Available: https: //onderzoekmetmensen.nl/en/trial/27652
-
[29]
N. K. Leidy, L. T. Murray, B. U. Monz, L. Nelsen, M. Goldman, P . W. Jones, E. J. Dansie, and S. Sethi, “Measuring respiratory symptoms of copd: performance of the exact-respiratory symp- toms tool (e-rs) in three clinical trials,” Respiratory Research, vol. 15, no. 1, pp. 1–10, 2014
work page 2014
-
[30]
Outlier detection: how to threshold outlier scores?
J. Yang, S. Rahardja, and P . Fr ¨anti, “Outlier detection: how to threshold outlier scores?” in Proceedings of the international confer- ence on artificial intelligence, information processing and cloud comput- ing, 2019, pp. 1–6
work page 2019
-
[31]
A. Chowdhury and A. Ross, “Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals,” IEEE transactions on information forensics and security , vol. 15, pp. 1616–1629, 2019
work page 2019
-
[32]
E. Rejaibi, A. Komaty, F. Meriaudeau, S. Agrebi, and A. Oth- mani, “Mfcc-based recurrent neural network for automatic clinical depression recognition and assessment from speech,” Biomedical Signal Processing and Control, vol. 71, p. 103107, 2022
work page 2022
-
[33]
Spafe: Simplified python audio features extraction,
A. Malek, “Spafe: Simplified python audio features extraction,” Journal of Open Source Software, vol. 8, no. 81, p. 4739, 2023
work page 2023
-
[34]
O. Kramer and O. Kramer, “Scikit-learn,” Machine learning for evolution strategies, pp. 45–53, 2016
work page 2016
-
[35]
J. Cervantes, F. Garcia-Lamont, L. Rodr ´ıguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomput- ing, vol. 408, pp. 189–215, 2020
work page 2020
-
[36]
At- lstm: An attention-based lstm model for financial time series pre- diction,
X. Zhang, X. Liang, A. Zhiyuli, S. Zhang, R. Xu, and B. Wu, “At- lstm: An attention-based lstm model for financial time series pre- diction,” in IOP Conference Series: Materials Science and Engineering, vol. 569, no. 5. IOP Publishing, 2019, p. 052037
work page 2019
-
[37]
Healthcare techniques through deep learning: issues, challenges and opportunities,
R. Amin, M. A. Al Ghamdi, S. H. Almotiri, M. Alruily et al. , “Healthcare techniques through deep learning: issues, challenges and opportunities,” IEEE Access, vol. 9, pp. 98 523–98 541, 2021
work page 2021
-
[38]
Predictions for covid- 19 with deep learning models of lstm, gru and bi-lstm,
F. Shahid, A. Zameer, and M. Muneeb, “Predictions for covid- 19 with deep learning models of lstm, gru and bi-lstm,” Chaos, Solitons & Fractals, vol. 140, p. 110212, 2020
work page 2020
-
[39]
Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data
Y. Yuyang, W. Aljbawi, S. O. Simmons, and V . Urovi, “Developing a multi-variate prediction model for covid-19 from crowd-sourced respiratory voice data,” arXiv:2402.07619, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Spafe Documentation, “Spafe.features.mfcc,” 2019, copyright 2019. https://spafe.readthedocs.io/en/latest/features/mfcc.html
work page 2019
-
[41]
Features for content-based audio retrieval,
D. Mitrovi ´c, M. Zeppelzauer, and C. Breiteneder, “Features for content-based audio retrieval,” in Advances in computers. Elsevier, 2010, vol. 78, pp. 71–150
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.