Voice Pathology Detection Using Deep Learning: a Preliminary Study

Jesus B. Alonso-Hernandez; Jiri Mekyska; Pavol Harar; Radim Burget; Zdenek Smekal; Zoltan Galaz

arxiv: 1907.05905 · v1 · pith:AMZ2FIRRnew · submitted 2019-07-12 · 📡 eess.AS · cs.LG· cs.SD

Voice Pathology Detection Using Deep Learning: a Preliminary Study

Pavol Harar , Jesus B. Alonso-Hernandez , Jiri Mekyska , Zoltan Galaz , Radim Burget , Zdenek Smekal This is my paper

Pith reviewed 2026-05-24 21:58 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords voice pathology detectiondeep neural networksconvolutional LSTMraw audio processingSaarbruecken Voice Databasesustained vowelpathology classificationpreliminary study

0 comments

The pith

Convolutional and LSTM layers on raw audio segments detect voice pathologies at 68 percent test accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a deep neural network can identify voice pathologies directly from short segments of sustained vowel recordings. Each recording is cut into 64 ms windows with 30 ms overlap and passed through convolutional layers followed by LSTM layers. On 874 test files the model reaches 68.08 percent accuracy along with 66.75 percent sensitivity and 77.89 percent specificity. These figures match the performance of earlier studies that used different signal-processing steps, indicating that the end-to-end waveform approach is feasible. A sympathetic reader would care because the result opens a route to simpler automated screening tools that do not require hand-crafted acoustic features.

Core claim

The central claim is that a network combining convolutional layers and LSTM layers, trained on raw 64 ms Hamming-windowed segments of the sustained vowel /a/, achieves 68.08 percent accuracy, 66.75 percent sensitivity, and 77.89 percent specificity on the held-out test portion of the Saarbruecken Voice Database and that this performance is comparable to previously published experiments that employed different methodology.

What carries the argument

The convolutional-recurrent (CNN-LSTM) architecture applied directly to overlapping raw-audio segments.

If this is right

Raw waveform input can substitute for manual feature extraction in voice pathology classification.
The reported accuracy level is comparable to earlier experiments that used alternative processing methods.
Additional data or architectural tuning could raise performance toward state-of-the-art levels.
The same segmentation and training procedure can be applied to the validation split of the same corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identical segmentation and network might be tested on continuous speech or other vowels to check broader applicability.
Feeding the available electroglottograph signals alongside the audio could improve accuracy without redesigning the core model.
Evaluating the model on an external corpus would reveal whether the learned patterns transfer beyond the original database.

Load-bearing premise

The pathology labels supplied with the Saarbruecken Voice Database are correct and the chosen short segments retain the diagnostic information without artifacts or loss of longer-term cues.

What would settle it

Running the identical trained model on an independent set of voice recordings whose healthy or pathological status has been verified by separate clinical examination; accuracy falling to chance level would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.05905 by Jesus B. Alonso-Hernandez, Jiri Mekyska, Pavol Harar, Radim Burget, Zdenek Smekal, Zoltan Galaz.

**Figure 1.** Figure 1: Detailed DNN architecture. function. All layers were initialized using Glorot uniform initialization [21]. This whole DNN had overall 428 772 trainable parameters and its whole architecture is depicted in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

This paper describes a preliminary investigation of Voice Pathology Detection using Deep Neural Networks (DNN). We used voice recordings of sustained vowel /a/ produced at normal pitch from German corpus Saarbruecken Voice Database (SVD). This corpus contains voice recordings and electroglottograph signals of more than 2 000 speakers. The idea behind this experiment is the use of convolutional layers in combination with recurrent Long-Short-Term-Memory (LSTM) layers on raw audio signal. Each recording was split into 64 ms Hamming windowed segments with 30 ms overlap. Our trained model achieved 71.36% accuracy with 65.04% sensitivity and 77.67% specificity on 206 validation files and 68.08% accuracy with 66.75% sensitivity and 77.89% specificity on 874 testing files. This is a promising result in favor of this approach because it is comparable to similar previously published experiment that used different methodology. Further investigation is needed to achieve the state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a small preliminary CNN-LSTM run on 64 ms raw segments from SVD sustained vowels that hits 68% test accuracy but provides no evidence the splits are speaker-disjoint.

read the letter

The paper tests a CNN followed by LSTM layers on short Hamming-windowed chunks of raw audio from the Saarbruecken Voice Database. They split each sustained /a/ recording into 64 ms segments with 30 ms overlap, train the model, and report 71.36% accuracy on 206 validation files and 68.08% on 874 test files, along with the corresponding sensitivity and specificity figures. Those concrete numbers on explicit file counts are the main thing the work delivers cleanly.

Referee Report

3 major / 1 minor

Summary. The paper presents a preliminary study using a CNN+LSTM model on raw 64 ms Hamming-windowed segments (30 ms overlap) of sustained /a/ vowels from the Saarbruecken Voice Database to detect voice pathology. It reports 71.36% accuracy (65.04% sensitivity, 77.67% specificity) on 206 validation files and 68.08% accuracy (66.75% sensitivity, 77.89% specificity) on 874 test files, claiming the result is promising and comparable to prior work.

Significance. If the reported accuracies are obtained under speaker-independent partitioning and with standard controls for imbalance and variance, the work would supply a modest baseline for end-to-end waveform modeling in voice pathology detection; the numbers themselves are not state-of-the-art but could motivate further investigation of short-segment CNN+LSTM pipelines.

major comments (3)

[Abstract / Methods] Abstract / Methods paragraph: the data split into training, 206-file validation, and 874-file test sets is not stated to be speaker-disjoint. Because SVD contains multiple recordings per speaker, speaker overlap would permit the model to exploit stable speaker traits rather than pathology cues within the 64 ms segments, rendering the headline accuracies non-diagnostic for the central claim.
[Abstract] Abstract: no training protocol, optimizer, learning-rate schedule, class-imbalance handling, or regularization details are supplied, nor are error bars or significance tests reported for the accuracy/sensitivity/specificity figures on the two held-out sets.
[Abstract] Abstract: the choice of 64 ms segments with 30 ms overlap is presented without ablation or justification that longer-term cues are unnecessary or that the windowing does not introduce artifacts that affect pathology discrimination.

minor comments (1)

[Abstract] Abstract: the claim of comparability to 'similar previously published experiment' lacks a citation or quantitative table comparing metrics and methodology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our preliminary study. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / Methods] Abstract / Methods paragraph: the data split into training, 206-file validation, and 874-file test sets is not stated to be speaker-disjoint. Because SVD contains multiple recordings per speaker, speaker overlap would permit the model to exploit stable speaker traits rather than pathology cues within the 64 ms segments, rendering the headline accuracies non-diagnostic for the central claim.

Authors: We acknowledge that the manuscript does not state the split is speaker-disjoint. The partitioning was performed at the file level without speaker independence constraints. We will revise the Methods section to explicitly describe the splitting procedure, note the limitation that speaker-specific traits may influence results, and clarify that the study is an initial exploration of the CNN+LSTM pipeline on raw audio rather than a speaker-independent diagnostic benchmark. revision: yes
Referee: [Abstract] Abstract: no training protocol, optimizer, learning-rate schedule, class-imbalance handling, or regularization details are supplied, nor are error bars or significance tests reported for the accuracy/sensitivity/specificity figures on the two held-out sets.

Authors: The preliminary nature of the work omitted these details. We will expand the Methods section to document the optimizer, learning-rate schedule, batch size, epochs, class-imbalance handling (if any), and regularization. Error bars and significance tests were not computed originally; we will add an explicit statement acknowledging this as a limitation of the reported figures. revision: partial
Referee: [Abstract] Abstract: the choice of 64 ms segments with 30 ms overlap is presented without ablation or justification that longer-term cues are unnecessary or that the windowing does not introduce artifacts that affect pathology discrimination.

Authors: The 64 ms window with 30 ms overlap follows standard short-time speech processing to enable local feature extraction by the CNN while permitting the LSTM to model temporal structure. We will add a brief justification in Methods citing prior voice pathology literature on frame lengths. No ablation on segment duration was performed; we will note this explicitly as future work rather than asserting optimality. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML results with no derivation chain

full rationale

This is a standard empirical machine-learning paper that trains a CNN+LSTM model on 64 ms audio segments from the Saarbruecken Voice Database and reports classification accuracies on held-out validation (206 files) and test (874 files) sets. No mathematical derivation, first-principles prediction, or parameter-fitting step is presented as an output that reduces to its own inputs. The reported metrics (71.36% validation accuracy, 68.08% test accuracy) are direct evaluation results, not predictions derived by construction from fitted quantities. No self-citations, uniqueness theorems, or ansatzes are invoked to justify any load-bearing claim. The experiment is therefore self-contained against external benchmarks and receives score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of the SVD pathology labels and on the unstated assumption that the chosen segment length and overlap are sufficient to capture diagnostic cues. No free parameters are explicitly fitted in the abstract, but the model itself contains the usual deep-learning hyperparameters whose values are not reported.

free parameters (1)

segment length and overlap
64 ms Hamming window with 30 ms overlap chosen without reported justification or ablation; affects every input example.

axioms (1)

domain assumption SVD pathology labels are ground truth
The experiment treats the database labels as correct for training and evaluation.

pith-pipeline@v0.9.0 · 5737 in / 1410 out tokens · 29657 ms · 2026-05-24T21:58:31.515138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

Robust and complex approach of pathological speech signal analysis,

J. Mekyska, E. Janousova, P. Gomez-Vilda, Z. Smekal, I. Rektorova, I. Eliasova, M. Kostalova, M. Mrackova, J. B. Alonso-Hernandez, M. Faundez-Zanuy et al., “Robust and complex approach of pathological speech signal analysis,” Neurocomputing, vol. 167, pp. 94–111, 2015

work page 2015
[2]

V oice pathology detection using interlaced derivative pattern on glottal source excitation,

G. Muhammad, M. Alsulaiman, Z. Ali, T. A. Mesallam, M. Farahat, K. H. Malki, A. Al-nasheri, and M. A. Bencherif, “V oice pathology detection using interlaced derivative pattern on glottal source excitation,” Biomedical Signal Processing and Control , vol. 31, pp. 156–164, 2017

work page 2017
[3]

Saarbruecken voice database,

B. Woldert-Jokisz, “Saarbruecken voice database,” 2007

work page 2007
[4]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998
[5]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[6]

V oice pathology detection using auto-correlation of different ﬁlters bank,

A. Al-nasheri, Z. Ali, G. Muhammad, and M. Alsulaiman, “V oice pathology detection using auto-correlation of different ﬁlters bank,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on . IEEE, 2014, pp. 50–55

work page 2014
[7]

V oice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,

D. Mart ´ınez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “V oice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,” in Advances in Speech and Language Technologies for Iberian Languages . Springer, 2012, pp. 99–109

work page 2012
[8]

Dimensionality reduction for voice disor- ders identiﬁcation system based on mel frequency cepstral coefﬁcients and support vector machine,

N. Souissi and A. Cherif, “Dimensionality reduction for voice disor- ders identiﬁcation system based on mel frequency cepstral coefﬁcients and support vector machine,” in Modelling, Identiﬁcation and Control (ICMIC), 2015 7th International Conference on . IEEE, 2015, pp. 1–6

work page 2015
[9]

Speech recognition system based on short-term cepstral pa- rameters, feature reduction method and artiﬁcial neural networks,

——, “Speech recognition system based on short-term cepstral pa- rameters, feature reduction method and artiﬁcial neural networks,” in Advanced Technologies for Signal and Image Processing (ATSIP), 2016 2nd International Conference on . IEEE, 2016, pp. 667–671

work page 2016
[10]

Investigation of voice pathology detection and classiﬁcation on different frequency regions using correlation functions,

A. Al-nasheri, G. Muhammad, M. Alsulaiman, and Z. Ali, “Investigation of voice pathology detection and classiﬁcation on different frequency regions using correlation functions,” Journal of Voice , vol. 31, no. 1, pp. 3–15, 2017

work page 2017
[11]

Healthcare big data voice pathology assessment framework,

M. S. Hossain and G. Muhammad, “Healthcare big data voice pathology assessment framework,” IEEE Access, vol. 4, pp. 7806–7815, 2016

work page 2016
[12]

V oice disorder classiﬁcation based on multitaper mel frequency cepstral coefﬁcients features,

¨O. Eskidere and A. G ¨urhanlı, “V oice disorder classiﬁcation based on multitaper mel frequency cepstral coefﬁcients features,” Computational and mathematical methods in medicine , vol. 2015, 2015

work page 2015
[13]

An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classiﬁcation,

A. Al-nasheri, G. Muhammad, M. Alsulaiman, Z. Ali, T. A. Mesallam, M. Farahat, K. H. Malki, and M. A. Bencherif, “An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classiﬁcation,” Journal of Voice , vol. 31, no. 1, pp. 113–e9, 2017

work page 2017
[14]

Enhanced living by assessing voice pathology using a co-occurrence matrix,

G. Muhammad, M. F. Alhamid, M. S. Hossain, A. S. Almogren, and A. V . Vasilakos, “Enhanced living by assessing voice pathology using a co-occurrence matrix,” Sensors, vol. 17, no. 2, p. 267, 2017

work page 2017
[15]

V oice data mining for laryngeal pathology assessment,

D. Hemmerling, A. Skalski, and J. Gajda, “V oice data mining for laryngeal pathology assessment,” Computers in biology and medicine , vol. 69, pp. 270–276, 2016

work page 2016
[16]

Improving neural networks by preventing co-adaptation of feature detectors

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580 , 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[17]

Keras: Deep learning library for theano and tensor- ﬂow,

F. Chollet et al. , “Keras: Deep learning library for theano and tensor- ﬂow,” URL: https://keras.io/, 2015

work page 2015
[18]

Dropout: a simple way to prevent neural networks from overﬁtting

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting.”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929
[19]

Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition,

J. S. Bridle, “Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition,” in Neurocomputing. Springer, 1990, pp. 227–236

work page 1990
[20]

Rectiﬁed linear units improve restricted boltz- mann machines,

V . Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz- mann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10) , 2010, pp. 807–814

work page 2010
[21]

Understanding the difﬁculty of training deep feedforward neural networks

X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256

work page 2010
[22]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[1] [1]

Robust and complex approach of pathological speech signal analysis,

J. Mekyska, E. Janousova, P. Gomez-Vilda, Z. Smekal, I. Rektorova, I. Eliasova, M. Kostalova, M. Mrackova, J. B. Alonso-Hernandez, M. Faundez-Zanuy et al., “Robust and complex approach of pathological speech signal analysis,” Neurocomputing, vol. 167, pp. 94–111, 2015

work page 2015

[2] [2]

V oice pathology detection using interlaced derivative pattern on glottal source excitation,

G. Muhammad, M. Alsulaiman, Z. Ali, T. A. Mesallam, M. Farahat, K. H. Malki, A. Al-nasheri, and M. A. Bencherif, “V oice pathology detection using interlaced derivative pattern on glottal source excitation,” Biomedical Signal Processing and Control , vol. 31, pp. 156–164, 2017

work page 2017

[3] [3]

Saarbruecken voice database,

B. Woldert-Jokisz, “Saarbruecken voice database,” 2007

work page 2007

[4] [4]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998

[5] [5]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[6] [6]

V oice pathology detection using auto-correlation of different ﬁlters bank,

A. Al-nasheri, Z. Ali, G. Muhammad, and M. Alsulaiman, “V oice pathology detection using auto-correlation of different ﬁlters bank,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on . IEEE, 2014, pp. 50–55

work page 2014

[7] [7]

V oice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,

D. Mart ´ınez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “V oice pathology detection on the saarbr ¨ucken voice database with calibration and fusion of scores using multifocal toolkit,” in Advances in Speech and Language Technologies for Iberian Languages . Springer, 2012, pp. 99–109

work page 2012

[8] [8]

Dimensionality reduction for voice disor- ders identiﬁcation system based on mel frequency cepstral coefﬁcients and support vector machine,

N. Souissi and A. Cherif, “Dimensionality reduction for voice disor- ders identiﬁcation system based on mel frequency cepstral coefﬁcients and support vector machine,” in Modelling, Identiﬁcation and Control (ICMIC), 2015 7th International Conference on . IEEE, 2015, pp. 1–6

work page 2015

[9] [9]

Speech recognition system based on short-term cepstral pa- rameters, feature reduction method and artiﬁcial neural networks,

——, “Speech recognition system based on short-term cepstral pa- rameters, feature reduction method and artiﬁcial neural networks,” in Advanced Technologies for Signal and Image Processing (ATSIP), 2016 2nd International Conference on . IEEE, 2016, pp. 667–671

work page 2016

[10] [10]

Investigation of voice pathology detection and classiﬁcation on different frequency regions using correlation functions,

A. Al-nasheri, G. Muhammad, M. Alsulaiman, and Z. Ali, “Investigation of voice pathology detection and classiﬁcation on different frequency regions using correlation functions,” Journal of Voice , vol. 31, no. 1, pp. 3–15, 2017

work page 2017

[11] [11]

Healthcare big data voice pathology assessment framework,

M. S. Hossain and G. Muhammad, “Healthcare big data voice pathology assessment framework,” IEEE Access, vol. 4, pp. 7806–7815, 2016

work page 2016

[12] [12]

V oice disorder classiﬁcation based on multitaper mel frequency cepstral coefﬁcients features,

¨O. Eskidere and A. G ¨urhanlı, “V oice disorder classiﬁcation based on multitaper mel frequency cepstral coefﬁcients features,” Computational and mathematical methods in medicine , vol. 2015, 2015

work page 2015

[13] [13]

An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classiﬁcation,

A. Al-nasheri, G. Muhammad, M. Alsulaiman, Z. Ali, T. A. Mesallam, M. Farahat, K. H. Malki, and M. A. Bencherif, “An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classiﬁcation,” Journal of Voice , vol. 31, no. 1, pp. 113–e9, 2017

work page 2017

[14] [14]

Enhanced living by assessing voice pathology using a co-occurrence matrix,

G. Muhammad, M. F. Alhamid, M. S. Hossain, A. S. Almogren, and A. V . Vasilakos, “Enhanced living by assessing voice pathology using a co-occurrence matrix,” Sensors, vol. 17, no. 2, p. 267, 2017

work page 2017

[15] [15]

V oice data mining for laryngeal pathology assessment,

D. Hemmerling, A. Skalski, and J. Gajda, “V oice data mining for laryngeal pathology assessment,” Computers in biology and medicine , vol. 69, pp. 270–276, 2016

work page 2016

[16] [16]

Improving neural networks by preventing co-adaptation of feature detectors

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580 , 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[17] [17]

Keras: Deep learning library for theano and tensor- ﬂow,

F. Chollet et al. , “Keras: Deep learning library for theano and tensor- ﬂow,” URL: https://keras.io/, 2015

work page 2015

[18] [18]

Dropout: a simple way to prevent neural networks from overﬁtting

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting.”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929

[19] [19]

Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition,

J. S. Bridle, “Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition,” in Neurocomputing. Springer, 1990, pp. 227–236

work page 1990

[20] [20]

Rectiﬁed linear units improve restricted boltz- mann machines,

V . Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz- mann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10) , 2010, pp. 807–814

work page 2010

[21] [21]

Understanding the difﬁculty of training deep feedforward neural networks

X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256

work page 2010

[22] [22]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014