pith. sign in

arxiv: 1907.05905 · v1 · pith:AMZ2FIRRnew · submitted 2019-07-12 · 📡 eess.AS · cs.LG· cs.SD

Voice Pathology Detection Using Deep Learning: a Preliminary Study

Pith reviewed 2026-05-24 21:58 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD
keywords voice pathology detectiondeep neural networksconvolutional LSTMraw audio processingSaarbruecken Voice Databasesustained vowelpathology classificationpreliminary study
0
0 comments X

The pith

Convolutional and LSTM layers on raw audio segments detect voice pathologies at 68 percent test accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a deep neural network can identify voice pathologies directly from short segments of sustained vowel recordings. Each recording is cut into 64 ms windows with 30 ms overlap and passed through convolutional layers followed by LSTM layers. On 874 test files the model reaches 68.08 percent accuracy along with 66.75 percent sensitivity and 77.89 percent specificity. These figures match the performance of earlier studies that used different signal-processing steps, indicating that the end-to-end waveform approach is feasible. A sympathetic reader would care because the result opens a route to simpler automated screening tools that do not require hand-crafted acoustic features.

Core claim

The central claim is that a network combining convolutional layers and LSTM layers, trained on raw 64 ms Hamming-windowed segments of the sustained vowel /a/, achieves 68.08 percent accuracy, 66.75 percent sensitivity, and 77.89 percent specificity on the held-out test portion of the Saarbruecken Voice Database and that this performance is comparable to previously published experiments that employed different methodology.

What carries the argument

The convolutional-recurrent (CNN-LSTM) architecture applied directly to overlapping raw-audio segments.

If this is right

  • Raw waveform input can substitute for manual feature extraction in voice pathology classification.
  • The reported accuracy level is comparable to earlier experiments that used alternative processing methods.
  • Additional data or architectural tuning could raise performance toward state-of-the-art levels.
  • The same segmentation and training procedure can be applied to the validation split of the same corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identical segmentation and network might be tested on continuous speech or other vowels to check broader applicability.
  • Feeding the available electroglottograph signals alongside the audio could improve accuracy without redesigning the core model.
  • Evaluating the model on an external corpus would reveal whether the learned patterns transfer beyond the original database.

Load-bearing premise

The pathology labels supplied with the Saarbruecken Voice Database are correct and the chosen short segments retain the diagnostic information without artifacts or loss of longer-term cues.

What would settle it

Running the identical trained model on an independent set of voice recordings whose healthy or pathological status has been verified by separate clinical examination; accuracy falling to chance level would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.05905 by Jesus B. Alonso-Hernandez, Jiri Mekyska, Pavol Harar, Radim Burget, Zdenek Smekal, Zoltan Galaz.

Figure 1
Figure 1. Figure 1: Detailed DNN architecture. function. All layers were initialized using Glorot uniform ini￾tialization [21]. This whole DNN had overall 428 772 trainable parameters and its whole architecture is depicted in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

This paper describes a preliminary investigation of Voice Pathology Detection using Deep Neural Networks (DNN). We used voice recordings of sustained vowel /a/ produced at normal pitch from German corpus Saarbruecken Voice Database (SVD). This corpus contains voice recordings and electroglottograph signals of more than 2 000 speakers. The idea behind this experiment is the use of convolutional layers in combination with recurrent Long-Short-Term-Memory (LSTM) layers on raw audio signal. Each recording was split into 64 ms Hamming windowed segments with 30 ms overlap. Our trained model achieved 71.36% accuracy with 65.04% sensitivity and 77.67% specificity on 206 validation files and 68.08% accuracy with 66.75% sensitivity and 77.89% specificity on 874 testing files. This is a promising result in favor of this approach because it is comparable to similar previously published experiment that used different methodology. Further investigation is needed to achieve the state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents a preliminary study using a CNN+LSTM model on raw 64 ms Hamming-windowed segments (30 ms overlap) of sustained /a/ vowels from the Saarbruecken Voice Database to detect voice pathology. It reports 71.36% accuracy (65.04% sensitivity, 77.67% specificity) on 206 validation files and 68.08% accuracy (66.75% sensitivity, 77.89% specificity) on 874 test files, claiming the result is promising and comparable to prior work.

Significance. If the reported accuracies are obtained under speaker-independent partitioning and with standard controls for imbalance and variance, the work would supply a modest baseline for end-to-end waveform modeling in voice pathology detection; the numbers themselves are not state-of-the-art but could motivate further investigation of short-segment CNN+LSTM pipelines.

major comments (3)
  1. [Abstract / Methods] Abstract / Methods paragraph: the data split into training, 206-file validation, and 874-file test sets is not stated to be speaker-disjoint. Because SVD contains multiple recordings per speaker, speaker overlap would permit the model to exploit stable speaker traits rather than pathology cues within the 64 ms segments, rendering the headline accuracies non-diagnostic for the central claim.
  2. [Abstract] Abstract: no training protocol, optimizer, learning-rate schedule, class-imbalance handling, or regularization details are supplied, nor are error bars or significance tests reported for the accuracy/sensitivity/specificity figures on the two held-out sets.
  3. [Abstract] Abstract: the choice of 64 ms segments with 30 ms overlap is presented without ablation or justification that longer-term cues are unnecessary or that the windowing does not introduce artifacts that affect pathology discrimination.
minor comments (1)
  1. [Abstract] Abstract: the claim of comparability to 'similar previously published experiment' lacks a citation or quantitative table comparing metrics and methodology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our preliminary study. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract / Methods paragraph: the data split into training, 206-file validation, and 874-file test sets is not stated to be speaker-disjoint. Because SVD contains multiple recordings per speaker, speaker overlap would permit the model to exploit stable speaker traits rather than pathology cues within the 64 ms segments, rendering the headline accuracies non-diagnostic for the central claim.

    Authors: We acknowledge that the manuscript does not state the split is speaker-disjoint. The partitioning was performed at the file level without speaker independence constraints. We will revise the Methods section to explicitly describe the splitting procedure, note the limitation that speaker-specific traits may influence results, and clarify that the study is an initial exploration of the CNN+LSTM pipeline on raw audio rather than a speaker-independent diagnostic benchmark. revision: yes

  2. Referee: [Abstract] Abstract: no training protocol, optimizer, learning-rate schedule, class-imbalance handling, or regularization details are supplied, nor are error bars or significance tests reported for the accuracy/sensitivity/specificity figures on the two held-out sets.

    Authors: The preliminary nature of the work omitted these details. We will expand the Methods section to document the optimizer, learning-rate schedule, batch size, epochs, class-imbalance handling (if any), and regularization. Error bars and significance tests were not computed originally; we will add an explicit statement acknowledging this as a limitation of the reported figures. revision: partial

  3. Referee: [Abstract] Abstract: the choice of 64 ms segments with 30 ms overlap is presented without ablation or justification that longer-term cues are unnecessary or that the windowing does not introduce artifacts that affect pathology discrimination.

    Authors: The 64 ms window with 30 ms overlap follows standard short-time speech processing to enable local feature extraction by the CNN while permitting the LSTM to model temporal structure. We will add a brief justification in Methods citing prior voice pathology literature on frame lengths. No ablation on segment duration was performed; we will note this explicitly as future work rather than asserting optimality. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML results with no derivation chain

full rationale

This is a standard empirical machine-learning paper that trains a CNN+LSTM model on 64 ms audio segments from the Saarbruecken Voice Database and reports classification accuracies on held-out validation (206 files) and test (874 files) sets. No mathematical derivation, first-principles prediction, or parameter-fitting step is presented as an output that reduces to its own inputs. The reported metrics (71.36% validation accuracy, 68.08% test accuracy) are direct evaluation results, not predictions derived by construction from fitted quantities. No self-citations, uniqueness theorems, or ansatzes are invoked to justify any load-bearing claim. The experiment is therefore self-contained against external benchmarks and receives score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of the SVD pathology labels and on the unstated assumption that the chosen segment length and overlap are sufficient to capture diagnostic cues. No free parameters are explicitly fitted in the abstract, but the model itself contains the usual deep-learning hyperparameters whose values are not reported.

free parameters (1)
  • segment length and overlap
    64 ms Hamming window with 30 ms overlap chosen without reported justification or ablation; affects every input example.
axioms (1)
  • domain assumption SVD pathology labels are ground truth
    The experiment treats the database labels as correct for training and evaluation.

pith-pipeline@v0.9.0 · 5737 in / 1410 out tokens · 29657 ms · 2026-05-24T21:58:31.515138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    Robust and complex approach of pathological speech signal analysis,

    J. Mekyska, E. Janousova, P. Gomez-Vilda, Z. Smekal, I. Rektorova, I. Eliasova, M. Kostalova, M. Mrackova, J. B. Alonso-Hernandez, M. Faundez-Zanuy et al., “Robust and complex approach of pathological speech signal analysis,” Neurocomputing, vol. 167, pp. 94–111, 2015

  2. [2]

    V oice pathology detection using interlaced derivative pattern on glottal source excitation,

    G. Muhammad, M. Alsulaiman, Z. Ali, T. A. Mesallam, M. Farahat, K. H. Malki, A. Al-nasheri, and M. A. Bencherif, “V oice pathology detection using interlaced derivative pattern on glottal source excitation,” Biomedical Signal Processing and Control , vol. 31, pp. 156–164, 2017

  3. [3]

    Saarbruecken voice database,

    B. Woldert-Jokisz, “Saarbruecken voice database,” 2007

  4. [4]

    Gradient-based learning applied to document recognition,

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998

  5. [5]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  6. [6]

    V oice pathology detection using auto-correlation of different filters bank,

    A. Al-nasheri, Z. Ali, G. Muhammad, and M. Alsulaiman, “V oice pathology detection using auto-correlation of different filters bank,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on . IEEE, 2014, pp. 50–55

  7. [7]

    V oice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,

    D. Mart ´ınez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “V oice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,” in Advances in Speech and Language Technologies for Iberian Languages . Springer, 2012, pp. 99–109

  8. [8]

    Dimensionality reduction for voice disor- ders identification system based on mel frequency cepstral coefficients and support vector machine,

    N. Souissi and A. Cherif, “Dimensionality reduction for voice disor- ders identification system based on mel frequency cepstral coefficients and support vector machine,” in Modelling, Identification and Control (ICMIC), 2015 7th International Conference on . IEEE, 2015, pp. 1–6

  9. [9]

    Speech recognition system based on short-term cepstral pa- rameters, feature reduction method and artificial neural networks,

    ——, “Speech recognition system based on short-term cepstral pa- rameters, feature reduction method and artificial neural networks,” in Advanced Technologies for Signal and Image Processing (ATSIP), 2016 2nd International Conference on . IEEE, 2016, pp. 667–671

  10. [10]

    Investigation of voice pathology detection and classification on different frequency regions using correlation functions,

    A. Al-nasheri, G. Muhammad, M. Alsulaiman, and Z. Ali, “Investigation of voice pathology detection and classification on different frequency regions using correlation functions,” Journal of Voice , vol. 31, no. 1, pp. 3–15, 2017

  11. [11]

    Healthcare big data voice pathology assessment framework,

    M. S. Hossain and G. Muhammad, “Healthcare big data voice pathology assessment framework,” IEEE Access, vol. 4, pp. 7806–7815, 2016

  12. [12]

    V oice disorder classification based on multitaper mel frequency cepstral coefficients features,

    ¨O. Eskidere and A. G ¨urhanlı, “V oice disorder classification based on multitaper mel frequency cepstral coefficients features,” Computational and mathematical methods in medicine , vol. 2015, 2015

  13. [13]

    An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classification,

    A. Al-nasheri, G. Muhammad, M. Alsulaiman, Z. Ali, T. A. Mesallam, M. Farahat, K. H. Malki, and M. A. Bencherif, “An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classification,” Journal of Voice , vol. 31, no. 1, pp. 113–e9, 2017

  14. [14]

    Enhanced living by assessing voice pathology using a co-occurrence matrix,

    G. Muhammad, M. F. Alhamid, M. S. Hossain, A. S. Almogren, and A. V . Vasilakos, “Enhanced living by assessing voice pathology using a co-occurrence matrix,” Sensors, vol. 17, no. 2, p. 267, 2017

  15. [15]

    V oice data mining for laryngeal pathology assessment,

    D. Hemmerling, A. Skalski, and J. Gajda, “V oice data mining for laryngeal pathology assessment,” Computers in biology and medicine , vol. 69, pp. 270–276, 2016

  16. [16]

    Improving neural networks by preventing co-adaptation of feature detectors

    G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580 , 2012

  17. [17]

    Keras: Deep learning library for theano and tensor- flow,

    F. Chollet et al. , “Keras: Deep learning library for theano and tensor- flow,” URL: https://keras.io/, 2015

  18. [18]

    Dropout: a simple way to prevent neural networks from overfitting

    N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

  19. [19]

    Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,

    J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing. Springer, 1990, pp. 227–236

  20. [20]

    Rectified linear units improve restricted boltz- mann machines,

    V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- mann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10) , 2010, pp. 807–814

  21. [21]

    Understanding the difficulty of training deep feedforward neural networks

    X. Glorot and Y . Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256

  22. [22]

    Adam: A Method for Stochastic Optimization

    D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014