pith. sign in

arxiv: 2402.07619 · v2 · submitted 2024-02-12 · 💻 cs.SD · cs.AI· eess.AS

Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data

Pith reviewed 2026-05-24 03:51 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords COVID-19 detectionvoice recordingsdeep learningHuBERTcrowd-sourced dataMel-spectrogramsspeech analysis
0
0 comments X

The pith

HuBERT identifies COVID-19 from voice recordings at 86% accuracy and 0.93 AUC using crowd-sourced data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds deep learning models that classify COVID-19 status using only voice recordings collected through a mobile app. It processes 893 samples with features such as Mel-spectrograms and MFCCs, then compares LSTM, CNN, and HuBERT architectures against baseline methods. The HuBERT model reaches 86 percent accuracy and 0.93 AUC, the highest among those tested. These outcomes point to voice as a viable signal for scalable COVID-19 identification in the post-pandemic period. A sympathetic reader would see this as a step toward low-cost, non-contact diagnostic tools.

Core claim

The authors develop deep learning models to identify COVID-19 from voice recording data using the Cambridge COVID-19 Sound database of 893 speech samples. Voice features including Mel-spectrograms, MFCC, and CNN Encoder features are extracted and used to train LSTM, CNN, and HuBERT models. HuBERT achieves the highest accuracy of 86% and AUC of 0.93, outperforming other models and suggesting promising results for COVID-19 diagnosis from voice recordings.

What carries the argument

The HuBERT model applied to Mel-spectrograms and MFCC voice features extracted from crowd-sourced recordings for COVID-19 classification.

If this is right

  • Voice-based deep learning models can achieve over 85% accuracy in COVID-19 detection.
  • HuBERT outperforms LSTM and CNN for this classification task.
  • Crowd-sourced voice data supports training of effective prediction models.
  • The method provides a non-invasive and scalable approach to COVID-19 identification.
  • Results compare favorably to state-of-the-art methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such models could be deployed in mobile apps for at-home screening to reduce reliance on clinical tests.
  • The technique may extend to other respiratory illnesses that alter voice characteristics.
  • Independent clinical validation is needed to confirm performance beyond app-reported labels.
  • Combining voice data with additional inputs could further improve multi-variate prediction accuracy.

Load-bearing premise

The labels reported by app users accurately indicate true COVID-19 infection without clinical test verification.

What would settle it

Testing the model on voice recordings from participants with independently confirmed PCR-positive or negative COVID-19 status to check if accuracy stays near 86%.

Figures

Figures reproduced from arXiv: 2402.07619 by Sami O. Simons, Visara Urovi, Wafaa Aljbawi, Yuyang Yan.

Figure 1
Figure 1. Figure 1: The used pipeline for both traditional Machine learning classifiers and Deep Learning classifiers for COVID-19 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Users characteristics (a) age, (b) gender, (c) COVID-19 test results, (d) the number of admissions to hospital [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curve for Models We test the performance of the LSTM model, we take the same strategy as the MFCC features extracted from the audio recordings. According to [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: ROC curve for Coswara dataset validation [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ROC curve for distinguishing COVID-19 from cold symptoms [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

COVID-19 has affected more than 223 countries worldwide and in the Post-COVID Era, there is a pressing need for non-invasive, low-cost, and highly scalable solutions to detect COVID-19. We develop a deep learning model to identify COVID-19 from voice recording data. The novelty of this work is in the development of deep learning models for COVID-19 identification from only voice recordings. We use the Cambridge COVID-19 Sound database which contains 893 speech samples, crowd-sourced from 4352 participants via a COVID-19 Sounds app. Voice features including Mel-spectrograms and Mel-frequency cepstral coefficients (MFCC) and CNN Encoder features are extracted. Based on the voice data, we develop deep learning classification models to detect COVID-19 cases. These models include Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) and Hidden-Unit BERT (HuBERT). We compare their predictive power to baseline machine learning models. HuBERT achieves the highest accuracy of 86\% and the highest AUC of 0.93. The results achieved with the proposed models suggest promising results in COVID-19 diagnosis from voice recordings when compared to the results obtained from the state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript develops deep learning classifiers (LSTM, CNN, HuBERT) on features including Mel-spectrograms, MFCCs, and CNN encoder outputs extracted from 893 crowd-sourced voice samples in the Cambridge COVID-19 Sound database. It reports that HuBERT attains the highest performance at 86% accuracy and 0.93 AUC for binary COVID-19 classification, outperforming the other models and baselines, and positions the results as promising for non-invasive detection.

Significance. If the reported metrics hold under rigorous evaluation, the work would demonstrate the applicability of self-supervised audio models like HuBERT to respiratory voice data for scalable screening. The scale of the crowd-sourced corpus and direct comparison among LSTM/CNN/HuBERT architectures provide a useful empirical baseline for this task.

major comments (2)
  1. [Methods] Methods section: the abstract and text supply no details on train-test split ratios, cross-validation procedure, class-balance handling, or statistical testing, so the central claim that HuBERT reaches 86% accuracy and 0.93 AUC cannot be verified or reproduced from the given information.
  2. [Data Description] Data section: the 893 samples rely on app-reported binary labels treated as ground truth, with no description of PCR/antigen confirmation, symptom cross-validation, or controls for selection bias and label noise; this assumption is load-bearing for the reported AUC and accuracy figures.
minor comments (1)
  1. [Abstract] Abstract: the statement that results are compared to 'state-of-the-art' lacks explicit citations or tabulated baseline numbers, making the comparative claim difficult to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important omissions in the methods and data sections that affect reproducibility and interpretation. We address each point below and commit to revisions that strengthen the manuscript without altering the core claims or results.

read point-by-point responses
  1. Referee: [Methods] Methods section: the abstract and text supply no details on train-test split ratios, cross-validation procedure, class-balance handling, or statistical testing, so the central claim that HuBERT reaches 86% accuracy and 0.93 AUC cannot be verified or reproduced from the given information.

    Authors: We agree that the original manuscript omitted these critical experimental details. In the revised version we will add a dedicated subsection describing the protocol: an 80/20 train-test split stratified by participant to avoid leakage, 5-fold cross-validation on the training set, class-imbalance handling via weighted cross-entropy loss (weights inversely proportional to class frequencies), and statistical testing (McNemar’s test for accuracy differences and DeLong’s test for AUC comparisons). These additions will allow independent verification of the reported HuBERT performance. revision: yes

  2. Referee: [Data Description] Data section: the 893 samples rely on app-reported binary labels treated as ground truth, with no description of PCR/antigen confirmation, symptom cross-validation, or controls for selection bias and label noise; this assumption is load-bearing for the reported AUC and accuracy figures.

    Authors: The Cambridge COVID-19 Sound database supplies only self-reported labels collected through the mobile app; PCR or antigen confirmation is not available for the majority of samples. We will expand the data section to state this explicitly, cite the original database paper for the collection protocol, and add a limitations paragraph discussing label noise, selection bias, and the absence of clinical confirmation. No additional ground-truth information exists in the released dataset, so we cannot retroactively provide PCR validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML accuracies on fixed dataset

full rationale

The paper trains standard classifiers (LSTM, CNN, HuBERT) on Mel-spectrograms/MFCC features from the Cambridge COVID-19 Sound database and reports held-out accuracy/AUC. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes that reduce the reported metrics to the inputs by construction. The evaluation follows conventional supervised learning practice on an external crowd-sourced corpus; the central numbers are not tautological with the training procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that app-provided COVID labels are accurate and that the collected voice samples are free of systematic selection or labeling bias; no free parameters, axioms, or invented entities are declared.

axioms (1)
  • domain assumption Crowd-sourced labels from the COVID-19 Sounds app constitute reliable ground truth
    Invoked when training and evaluating all models on the 893 samples

pith-pipeline@v0.9.0 · 5767 in / 1178 out tokens · 24425 ms · 2026-05-24T03:51:23.258390+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimising MFCC parameters for the automatic detection of respiratory diseases

    cs.SD 2024-08 conditional novelty 3.0

    Empirical tuning of MFCC parameters (roughly 30 coefficients, shorter hops, dataset-dependent frame lengths) improves SVM accuracy for respiratory disease detection by 14.9-19.6% on COVID-19 and voice-disorder datasets.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Covid-19 coronavirus outbreak,

    Worldometer, “Covid-19 coronavirus outbreak,” 2023

  2. [2]

    Severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and coronavirus disease-2019 (covid-19): The epidemic and the challenges,

    C.-C. Lai, T.-P . Shih, W.-C. Ko, H.-J. Tang, and P .-R. Hsueh, “Severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and coronavirus disease-2019 (covid-19): The epidemic and the challenges,” International journal of antimicrobial agents, vol. 55, no. 3, p. 105924, 2020

  3. [3]

    Science brief: Sars-cov-2 and surface (fomite) transmission for indoor community environments,

    N. C. for Immunization et al. , “Science brief: Sars-cov-2 and surface (fomite) transmission for indoor community environments,” in CDC COVID-19 Science Briefs [Internet]. Centers for Disease Control and Prevention (US), 2021

  4. [4]

    Covid 19 can spread through breathing, talking, study estimates,

    R. Ningthoujam, “Covid 19 can spread through breathing, talking, study estimates,” Current medicine research and practice , vol. 10, no. 3, p. 132, 2020

  5. [5]

    Sounds of covid-19: exploring realistic performance of audio-based digital testing,

    J. Han, T. Xia, D. Spathis, E. Bondareva, C. Brown, J. Chauhan, T. Dang, A. Grammenos, A. Hasthanasombat, A. Floto et al. , “Sounds of covid-19: exploring realistic performance of audio-based digital testing,” NPJ digital medicine, vol. 5, no. 1, pp. 1–9, 2022

  6. [6]

    Automatic detection of covid-19 based on short-duration acoustic smartphone speech analysis,

    B. Stasak, Z. Huang, S. Razavi, D. Joachim, and J. Epps, “Automatic detection of covid-19 based on short-duration acoustic smartphone speech analysis,” Journal of Healthcare Informatics Research, vol. 5, no. 2, pp. 201–217, 2021

  7. [7]

    Covid-19 detection system using recurrent neural networks,

    A. Hassan, I. Shahin, and M. B. Alsabek, “Covid-19 detection system using recurrent neural networks,” in 2020 International conference on communications, computing, cybersecurity, and informatics (CCCI). IEEE, 2020, pp. 1–5

  8. [8]

    Detection of covid-19 using heart rate and blood pressure: Lessons learned from patients with ards,

    M. A. Mehrabadi, S. A. H. Aqajari, I. Azimi, C. A. Downs, N. Dutt, and A. M. Rahmani, “Detection of covid-19 using heart rate and blood pressure: Lessons learned from patients with ards,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 2140–2143. SEPTEMBER 2023 14

  9. [9]

    Vibration feature extraction using audio spectrum analyzer based machine learning,

    J.-S. Liang and K. Wang, “Vibration feature extraction using audio spectrum analyzer based machine learning,” in 2017 International conference on information, Communication and Engineering (ICICE) . IEEE, 2017, pp. 381–384

  10. [10]

    Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data,

    C. Brown, J. Chauhan, A. Grammenos, J. Han, A. Hasthanasombat, D. Spathis, T. Xia, P . Cicuta, and C. Mascolo, “Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data,” arXiv preprint arXiv:2006.05919, 2020

  11. [11]

    Do you have covid-19? an artificial intelligence-based screening tool for covid-19 using acoustic parameters,

    A. Vahedian-Azimi, A. Keramatfar, M. Asiaee, S. S. Atashi, and M. Nourbakhsh, “Do you have covid-19? an artificial intelligence-based screening tool for covid-19 using acoustic parameters,” The Journal of the Acoustical Society of America , vol. 150, no. 3, pp. 1945–1953, 2021

  12. [12]

    Detection of covid-19 from voice, cough and breathing patterns: Dataset and preliminary results,

    V . Despotovic, M. Ismael, M. Cornil, R. Mc Call, and G. Fagherazzi, “Detection of covid-19 from voice, cough and breathing patterns: Dataset and preliminary results,” Computers in Biology and Medicine, vol. 138, p. 104944, 2021

  13. [13]

    Diagnostic accuracy of rapid antigen tests for covid-19 detection: a systematic review with meta-analysis,

    M. Arshadi, F. Fardsanei, B. Deihim, Z. Farshadzadeh, F. Nikkhahi, F. Khalili, G. Sotgiu, A. H. Shahidi Bonjar, R. Centis, G. B. Migliori et al., “Diagnostic accuracy of rapid antigen tests for covid-19 detection: a systematic review with meta-analysis,” Frontiers in medicine , vol. 9, p. 984, 2022

  14. [14]

    Covid-19 detection systems using deep-learning algorithms based on speech and image data,

    A. B. Nassif, I. Shahin, M. Bader, A. Hassan, and N. Werghi, “Covid-19 detection systems using deep-learning algorithms based on speech and image data,” Mathematics, vol. 10, no. 4, p. 564, 2022

  15. [15]

    Covnet: A transfer learning framework for automatic covid-19 detection from crowd-sourced cough sounds,

    Y. Chang, X. Jing, Z. Ren, and B. W. Schuller, “Covnet: A transfer learning framework for automatic covid-19 detection from crowd-sourced cough sounds,” Frontiers in Digital Health, vol. 3, 2021

  16. [16]

    Pay attention to the speech: Covid-19 diagnosis using machine learning and crowdsourced respiratory and speech recordings,

    M. Aly, K. H. Rahouma, and S. M. Ramzy, “Pay attention to the speech: Covid-19 diagnosis using machine learning and crowdsourced respiratory and speech recordings,” Alexandria Engineering Journal, vol. 61, no. 5, pp. 3487–3500, 2022

  17. [17]

    The interspeech 2021 computational paralinguistics challenge: Covid-19 cough, covid-19 speech, escalation & primates,

    B. W. Schuller, A. Batliner, C. Bergler, C. Mascolo, J. Han, I. Lefter, H. Kaya, S. Amiriparian, A. Baird, L. Stappen et al., “The interspeech 2021 computational paralinguistics challenge: Covid-19 cough, covid-19 speech, escalation & primates,” arXiv preprint arXiv:2102.13468, 2021

  18. [18]

    Voice for health: The use of vocal biomarkers from research to clinical practice,

    G. Fagherazzi, A. Fischer, M. Ismael, and V . Despotovic, “Voice for health: The use of vocal biomarkers from research to clinical practice,” Digital biomarkers, vol. 5, no. 1, pp. 78–88, 2021

  19. [19]

    Automatic diagnosis of covid-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: cough, voice, and breath,

    K. K. Lella and A. Pja, “Automatic diagnosis of covid-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: cough, voice, and breath,” Alexandria Engineering Journal, vol. 61, no. 2, pp. 1319–1334, 2022

  20. [20]

    Identify- ing individuals with recent covid-19 through voice classification using deep learning,

    P . Suppakitjanusant, S. Sungkanuparph, T. Wongsinin, S. Virapongsiri, N. Kasemkosin, L. Chailurkit, and B. Ongphiphadhanakul, “Identify- ing individuals with recent covid-19 through voice classification using deep learning,” Scientific Reports, vol. 11, no. 1, pp. 1–7, 2021

  21. [21]

    Using ai to predict service agent stress from emotion patterns in service interactions,

    S. Bromuri, A. P . Henkel, D. Iren, and V . Urovi, “Using ai to predict service agent stress from emotion patterns in service interactions,”Journal of Service Management, vol. 32, no. 4, pp. 581–611, 2021

  22. [22]

    An analytical study of speech pathology detection based on mfcc and deep neural networks,

    M. Zakariah, Y. Ajmi Alothaibi, Y. Guo, K. Tran-Trung, M. M. Elahi et al., “An analytical study of speech pathology detection based on mfcc and deep neural networks,” Computational and Mathematical Methods in Medicine , vol. 2022, 2022

  23. [23]

    Mel frequency cepstral coefficients for music modeling,

    B. Logan, “Mel frequency cepstral coefficients for music modeling,” in In International Symposium on Music Information Retrieval . Citeseer, 2000

  24. [24]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  25. [25]

    Support-vector networks,

    C. Cortes and V . Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995

  26. [26]

    An Introduction to Convolutional Neural Networks

    K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” arXiv preprint arXiv:1511.08458, 2015

  27. [27]

    A logical calculus of the ideas immanent in nervous activity,

    W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics , vol. 5, no. 4, pp. 115–133, 1943

  28. [28]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

  29. [29]

    Analysis of voice as an assisting tool for detection of parkinson’s disease and its subsequent clinical interpretation,

    G. Solana-Lavalle and R. Rosas-Romero, “Analysis of voice as an assisting tool for detection of parkinson’s disease and its subsequent clinical interpretation,” Biomedical Signal Processing and Control, vol. 66, p. 102415, 2021

  30. [30]

    Parkinson’s disease diagnosis using machine learning and voice,

    T. J. Wroge, Y. ¨Ozkanca, C. Demiroglu, D. Si, D. C. Atkins, and R. H. Ghomi, “Parkinson’s disease diagnosis using machine learning and voice,” in 2018 IEEE Signal Processing in Medicine and Biology Symposium (SPMB) . IEEE, 2018, pp. 1–7

  31. [31]

    Attention-based hybrid cnn-lstm and spectral data augmentation for covid-19 diagnosis from cough sound,

    S. Hamdi, M. Oussalah, A. Moussaoui, and M. Saidi, “Attention-based hybrid cnn-lstm and spectral data augmentation for covid-19 diagnosis from cough sound,” Journal of Intelligent Information Systems, vol. 59, no. 2, pp. 367–389, 2022

  32. [32]

    Exploring auditory acoustic features for the diagnosis of covid-19,

    M. R. Kamble, J. Patino, M. A. Zuluaga, and M. Todisco, “Exploring auditory acoustic features for the diagnosis of covid-19,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 566–570. SEPTEMBER 2023 15

  33. [33]

    Coswara–a database of breath- ing, cough, and voice sounds for covid-19 diagnosis,

    N. Sharma, P . Krishnan, R. Kumar, S. Ramoji, S. R. Chetupalli, P . K. Ghosh, S. Ganapathy et al., “Coswara–a database of breathing, cough, and voice sounds for covid-19 diagnosis,” arXiv preprint arXiv:2005.10548, 2020

  34. [34]

    Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

    M. Huzaifah, “Comparison of time-frequency representations for environmental sound classification using convolutional neural networks,” arXiv preprint arXiv:1706.07156, 2017

  35. [35]

    Respiratory health sensing from speech,

    V . S. Nallanthighal, “Respiratory health sensing from speech,” Ph.D. dissertation, Amsterdam: LOT, 2022

  36. [36]

    Learning to forget: Continual prediction with lstm,

    F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” Neural computation, vol. 12, no. 10, pp. 2451–2471, 2000

  37. [37]

    Librispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2015, pp. 5206–5210

  38. [38]

    A novel deep learning model to detect covid-19 based on wavelet features extracted from mel-scale spectrogram of patients’ cough and breathing sounds,

    M. Aly and N. S. Alotaibi, “A novel deep learning model to detect covid-19 based on wavelet features extracted from mel-scale spectrogram of patients’ cough and breathing sounds,” Informatics in Medicine Unlocked, vol. 32, p. 101049, 2022