Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech

Bozena Kostek; Daniel Korzekwa; Mateusz Lajszczak; Roberto Barra-Chicote; Thomas Drugman

REVIEW 3 major objections 1 minor 44 references

An encoder-decoder model factorizes dysarthric speech into a low-dimensional latent space that captures intelligibility and fluency for improved detection and reconstruction.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-24 23:21 UTC pith:7DMEIYN5

load-bearing objection The paper introduces a text-conditioned autoencoder with multi-task dysarthria detection on the latent space, but the key accuracy gain is attributed to the low-dimensional representation without showing a direct high-dimensional baseline. the 3 major comments →

arxiv 1907.04743 v1 pith:7DMEIYN5 submitted 2019-07-10 eess.AS cs.CLcs.SD

Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech

Daniel Korzekwa , Roberto Barra-Chicote , Bozena Kostek , Thomas Drugman , Mateusz Lajszczak This is my paper

classification eess.AS cs.CLcs.SD

keywords dysarthric speechspeech detectionspeech reconstructionlatent spacemulti-task learningauto-encoderfluency adaptationMUSHRA test

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that speech can be factorized by an encoder-decoder into a compact latent representation plus text encoding, where the latent part carries measurable traits of dysarthria. A multi-task setup that predicts both dysarthria probability and the mel-spectrogram from this space yields higher detection accuracy than direct prediction from high-dimensional spectrograms. Adapting the latent variables then produces output speech rated higher in fluency by listeners in a MUSHRA test. A sympathetic reader would care because current dysarthria tools often treat detection and modification separately and lack interpretable controls.

Core claim

The encoder-decoder model factorizes speech into a low-dimensional latent space and encoding of the input text. The latent space conveys interpretable characteristics of dysarthria such as intelligibility and fluency of speech. MUSHRA perceptual test demonstrated that the adaptation of the latent space let the model generate speech of improved fluency. The multi-task supervised approach for predicting both the probability of dysarthric speech and the mel-spectrogram helps improve the detection of dysarthria with higher accuracy thanks to a low-dimensional latent space of the auto-encoder as opposed to directly predicting dysarthria from a highly dimensional mel-spectrogram.

What carries the argument

The encoder-decoder that factorizes input speech into a low-dimensional latent space alongside text encoding.

Load-bearing premise

The low-dimensional latent space learned by the auto-encoder actually conveys interpretable characteristics of dysarthria such as intelligibility and fluency, and that adapting this space produces measurably improved fluency.

What would settle it

A direct comparison showing that detection accuracy does not rise when the model predicts from the low-dimensional latent space versus from the raw mel-spectrogram, or that MUSHRA fluency scores do not increase after latent-space adaptation.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Detection accuracy increases when the model jointly predicts dysarthria probability and the mel-spectrogram from the latent space.
The latent variables can be adjusted to raise the fluency rating of reconstructed speech in listening tests.
Intelligibility and fluency become directly readable from coordinates in the learned latent space.
Reconstruction quality improves because the model separates dysarthria traits from linguistic content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization could be tested on other motor speech disorders to see whether the latent dimensions remain clinically meaningful.
If the latent space generalizes across speakers, it might support speaker-independent adaptation for assistive devices.
A follow-up experiment could measure whether the same latent adjustments also change word-error rates in automatic speech recognition of the output.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper introduces a text-conditioned autoencoder with multi-task dysarthria detection on the latent space, but the key accuracy gain is attributed to the low-dimensional representation without showing a direct high-dimensional baseline.

read the letter

The core idea here is a joint model that encodes speech into a low-dimensional latent space plus text, then does multi-task prediction of both mel-spectrogram and dysarthria probability. That factorization and the multi-task setup on the latent variables look like the actual new piece. The abstract also reports that the latent space picks up interpretable traits like intelligibility and fluency, and a MUSHRA test shows adaptation improves perceived fluency. Those are concrete steps worth noting for assistive speech work. The main soft spot is the causal claim that detection accuracy improves specifically because of the low-dimensional latent space rather than direct prediction on mel-spectrograms. No high-dimensional baseline is described, so it is not clear whether the lift comes from the compression, the multi-task supervision, or the text conditioning. The abstract also gives no numbers, dataset sizes, or statistical tests, which makes the strength of the results hard to judge from what is shown. If the full paper supplies those controlled comparisons and the raw scores, the work becomes more solid. This is aimed at people working on clinical speech processing and dysarthria tools rather than a broad ML audience. It is worth sending to peer review so referees can check the experimental controls and see whether the latent-space advantage holds up under direct comparison.

Referee Report

3 major / 1 minor

Summary. The paper proposes a multi-task encoder-decoder model that factorizes dysarthric speech into a low-dimensional latent space plus text encoding. It claims the latent space is interpretable with respect to intelligibility and fluency, that latent-space adaptation yields improved fluency per MUSHRA testing, and that the multi-task setup produces higher dysarthria detection accuracy specifically because it operates in the compressed latent space rather than directly on high-dimensional mel-spectrograms.

Significance. If the central claims were substantiated with controlled baselines, quantitative metrics, and statistical validation, the work would provide a concrete example of an interpretable latent representation tied to perceptual speech attributes and a practical multi-task architecture for simultaneous detection and reconstruction; such a result would be of interest to clinical speech technology.

major comments (3)

[Abstract / strongest claim] The load-bearing claim that detection accuracy improves 'thanks to a low-dimensional latent space of the auto-encoder as opposed to directly predicting dysarthria from a highly dimensional mel-spectrogram' is not supported by any referenced baseline experiment that trains a dysarthria classifier directly on mel-spectrograms; without this controlled comparison the causal attribution cannot be verified.
[Abstract] No quantitative results (accuracy values, dataset sizes, statistical tests, or error analysis) are supplied for either the detection task or the MUSHRA perceptual test, so it is impossible to determine whether the data actually support the stated improvements.
[Abstract / weakest assumption] The assertion that the latent space 'conveys interpretable characteristics of dysarthria, such as intelligibility and fluency' is stated without any described method, visualization, or correlation analysis linking specific latent dimensions to those perceptual attributes.

minor comments (1)

[Abstract] The abstract contains a tense inconsistency ('This paper proposed').

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive referee report. We appreciate the feedback highlighting areas where the abstract and manuscript require clarification and additional support for the claims. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: The load-bearing claim that detection accuracy improves 'thanks to a low-dimensional latent space of the auto-encoder as opposed to directly predicting dysarthria from a highly dimensional mel-spectrogram' is not supported by any referenced baseline experiment that trains a dysarthria classifier directly on mel-spectrograms; without this controlled comparison the causal attribution cannot be verified.

Authors: We agree that the abstract phrasing attributes the improvement specifically to the latent space without an explicit controlled baseline on raw mel-spectrograms. The manuscript compares the multi-task model against several alternatives, but does not isolate this exact direct mel-spectrogram classifier. In the revision we will add this baseline experiment (or clearly reference it if present in supplementary material) together with accuracy numbers and statistical tests to substantiate or qualify the claim. revision: yes
Referee: No quantitative results (accuracy values, dataset sizes, statistical tests, or error analysis) are supplied for either the detection task or the MUSHRA perceptual test, so it is impossible to determine whether the data actually support the stated improvements.

Authors: The abstract is written as a concise summary and therefore omits specific numbers. The body of the manuscript reports dataset sizes, detection accuracies, MUSHRA scores, and some statistical comparisons. To address the concern we will revise the abstract to include the key quantitative results (e.g., accuracy figures and MUSHRA means) and ensure all claims are explicitly tied to the reported statistics and tests. revision: yes
Referee: The assertion that the latent space 'conveys interpretable characteristics of dysarthria, such as intelligibility and fluency' is stated without any described method, visualization, or correlation analysis linking specific latent dimensions to those perceptual attributes.

Authors: We acknowledge that the abstract states the interpretability claim without describing the supporting analysis. The full manuscript contains visualizations and correlation analyses linking latent dimensions to intelligibility and fluency scores. In the revision we will add a brief description of the method (e.g., correlation with perceptual ratings or dimension-wise analysis) to the abstract so the claim is properly grounded. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external tests and baselines, not self-referential definitions or fitted inputs

full rationale

The paper describes a multi-task encoder-decoder model whose central claims (improved dysarthria detection accuracy via low-dimensional latent space, interpretable characteristics of dysarthria, and fluency gains via MUSHRA) are presented as outcomes of supervised training and human perceptual evaluation. No equations, derivations, or parameter-fitting steps are described that would reduce these outcomes to quantities defined by the model's own fitted values. The attribution to the latent space is an empirical hypothesis tested against data rather than a self-definitional or self-citation load-bearing reduction. The absence of a direct high-dimensional baseline is a methodological gap but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that a standard neural autoencoder can be trained to produce a low-dimensional latent space whose dimensions align with human-interpretable speech properties, plus the usual deep-learning premise that mel-spectrograms are a sufficient representation for this task.

free parameters (1)

latent space dimension
The paper selects a low-dimensional size to enable both interpretability and improved detection; its exact value is a modeling choice.

axioms (2)

domain assumption Mel-spectrograms contain the acoustic features needed to distinguish dysarthric from typical speech.
The model operates on mel-spectrograms as both input and reconstruction target.
domain assumption The encoder-decoder can be trained to factor speech into an independent text encoding and a separate dysarthria-related latent code.
This factorization is the core architectural premise stated in the abstract.

pith-pipeline@v0.9.0 · 5675 in / 1375 out tokens · 25659 ms · 2026-05-24T23:21:26.793048+00:00 · methodology

0 comments

read the original abstract

This paper proposed a novel approach for the detection and reconstruction of dysarthric speech. The encoder-decoder model factorizes speech into a low-dimensional latent space and encoding of the input text. We showed that the latent space conveys interpretable characteristics of dysarthria, such as intelligibility and fluency of speech. MUSHRA perceptual test demonstrated that the adaptation of the latent space let the model generate speech of improved fluency. The multi-task supervised approach for predicting both the probability of dysarthric speech and the mel-spectrogram helps improve the detection of dysarthria with higher accuracy. This is thanks to a low-dimensional latent space of the auto-encoder as opposed to directly predicting dysarthria from a highly dimensional mel-spectrogram.

Figures

Figures reproduced from arXiv: 1907.04743 by Bozena Kostek, Daniel Korzekwa, Mateusz Lajszczak, Roberto Barra-Chicote, Thomas Drugman.

**Figure 1.** Figure 1: Architecture of deep learning model for detection and reconstruction of dysarthric speech. Let us define a matrix X : [nmels, nf ] representing a mel-spectrogram (frame length=50ms and frame shift=12.5ms), where nmels = 128 is the number of mel-frequency bands and nf is the number of frames. Let us define a matrix T : [nc, nt] representing a one-hot encoded input text, where nc is the number of unique cha… view at source ↗

**Figure 4.** Figure 4: MUSHRA results for the fluency of speech for 5 reconstructions and one recorded speech. Rank order (left) and the median score on the scale from 0 to 100 (right) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Supervised learning. As in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Introduction Dysarthria is a motor speech disorder manifesting itself by a weakness of muscles controlled by the brain and nervous sys- tem that are used in the process of speech production, such as lips, jaw and throat [1]. Patients with dysarthria produce harsh and breathy speech with abnormal prosodic patterns, such as very low speech rate or ﬂat inton...

work page 2015
[2]

Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech

Related work 2.1. Dysarthria detection Deep neural networks can automatically detect dysarthric pat- terns without any prior expert knowledge [7, 8]. Unfortunately, these models are difﬁcult to interpret because they are usually composed of multiple layers producing multidimensional out- puts with an arbitrary meaning and representation. Contrar- ily, sta...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

The audio and text en- coders produce a low-dimensional dysarthric latent space and a sequential encoding of the input text

Proposed model The model consists of two output networks, jointly trained, with a shared encoder as shown in Figure 1. The audio and text en- coders produce a low-dimensional dysarthric latent space and a sequential encoding of the input text. The audio decoder recon- structs input mel-spectrogram from a dysarthric latent space and encoded text. Logistic ...

work page
[4]

Dysarthric speech database There is no well-established benchmark in the literature to com- pare different models for detecting dysarthria

Experiments 4.1. Dysarthric speech database There is no well-established benchmark in the literature to com- pare different models for detecting dysarthria. Aside from the most popular dysarthric corpora, UA-Speech [31] and TORGO [32], there are multiple speech databases created for the pur- pose of a speciﬁc study, for example, corpora of 57 dysarthric s...

work page
[5]

The encoder-decoder model factorizes speech into a low-dimensional latent space and en- coding of the input text

Conclusions This paper proposed a novel approach for the detection and re- construction of dysarthric speech. The encoder-decoder model factorizes speech into a low-dimensional latent space and en- coding of the input text. We showed that the latent space con- veys interpretable characteristics of dysarthria, such as intelligi- bility and ﬂuency of speech...

work page
[6]

Nadolski, J

Acknowledgements We would like to thank A. Nadolski, J. Droppo, J. Rohnke and V . Klimkov for insightful discussions on this work

work page
[7]

The American Speech-Language-Hearing Association (ASHA) - Dysarthria,

ASHA, “The American Speech-Language-Hearing Association (ASHA) - Dysarthria,” 2018

work page 2018
[8]

Neuropsychologi- cal improvement after posterior fossa arachnoid cyst drainage,

M. L. Cuny, M. Pallone, H. Piana, N. Boddaert, C. Sainte-Rose, L. Vaivre-Douret, P. Piolino, and S. Puget, “Neuropsychologi- cal improvement after posterior fossa arachnoid cyst drainage,” Child’s Nervous System, 2017

work page 2017
[9]

Communication Difﬁculties as a Result of Dementia,

S. Banovic, L. Zunic, and O. Sinanovic, “Communication Difﬁculties as a Result of Dementia,” Materia Socio Medica , vol. 30, no. 2, p. 221, 2018. [Online]. Available: https: //www.ejmanager.com/fulltextpdf.php?mno=302643414

work page 2018
[10]

One in three people born in 2015 will de- velop dementia, new analysis shows,

Alzheimersresearchuk, “One in three people born in 2015 will de- velop dementia, new analysis shows,” 2015

work page 2015
[11]

Speech synthesis technologies for individuals with vocal disabilities: V oice banking and reconstruction,

J. Yamagishi, C. Veaux, S. King, and S. Renals, “Speech synthesis technologies for individuals with vocal disabilities: V oice banking and reconstruction,” Acoustical Science and Technology, vol. 33, no. 1, pp. 1–5, 2012

work page 2012
[12]

Combining neural network and rule-based systems for dysarthria diagnosis,

J. Carmichael, V . Wan, and P. Green, “Combining neural network and rule-based systems for dysarthria diagnosis,” in Proceedings of the Annual Conference of the International Speech Communi- cation Association, INTERSPEECH, 2008

work page 2008
[13]

Excitation Source Analysis of Dysarthric Speech for Early Stage Detection of Dysarthria,

G. Krishna, “Excitation Source Analysis of Dysarthric Speech for Early Stage Detection of Dysarthria,” WSPD, 2018

work page 2018
[14]

A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson’s Disease,

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, and E. N ¨oth, “A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson’s Disease,” in In- terspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., B. Yegnanarayana, Ed. ISCA, 2018,...

work page 2018
[15]

Characterization of atyp- ical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility,

T. H. Falk, W. Y . Chan, and F. Shein, “Characterization of atyp- ical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility,” Speech Communication, 2012

work page 2012
[16]

Automated Dysarthria Severity Clas- siﬁcation for Improved Objective Intelligibility Assessment of Spastic Dysarthric Speech

M. Sarria-Paja and T. Falk, “Automated Dysarthria Severity Clas- siﬁcation for Improved Objective Intelligibility Assessment of Spastic Dysarthric Speech.” in Interspeech, 2012

work page 2012
[17]

Cross-database models for the classiﬁcation of dysarthria presence,

S. Gillespie, Y . Y . Logan, E. Moore, J. Laures-Gore, S. Rus- sell, and R. Patel, “Cross-database models for the classiﬁcation of dysarthria presence,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTER- SPEECH, 2017

work page 2017
[18]

V owel Acoustics in Dysarthria: Speech Disorder Diagnosis and Classiﬁcation,

K. L. Lansford and J. M. Liss, “V owel Acoustics in Dysarthria: Speech Disorder Diagnosis and Classiﬁcation,”Journal of Speech Language and Hearing Research, 2014

work page 2014
[19]

Interpretable Objective Assess- ment of Dysarthric Speech Based on Deep Neural Networks,

M. Tu, V . Berisha, and J. Liss, “Interpretable Objective Assess- ment of Dysarthric Speech Based on Deep Neural Networks,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed. ISCA, 2017, pp. 1849–1853

work page 2017
[20]

www.modeltalker.com

Modeltalker, “www.modeltalker.com.”

work page
[21]

Effect of data reduction on sequence-to-sequence neural {TTS},

J. Latorre, J. Lachowicz, J. Lorenzo-Trueba, T. Merritt, T. Drug- man, S. Ronanki, and K. Viacheslav, “Effect of data reduction on sequence-to-sequence neural {TTS},” CoRR, vol. abs/1811.0, 2018

work page 2018
[22]

Re- constructing the voice of an individual following laryngectomy,

Z. Ahmad Khan, P. Green, S. Creer, and S. Cunningham, “Re- constructing the voice of an individual following laryngectomy,” 2011

work page 2011
[23]

Rabiner and R

L. Rabiner and R. Schafer, Digital Processing of Speech Signals. Englewood Cliffs: Prentice Hall, 1978

work page 1978
[24]

Glot- tal source processing: from analysis to applications,

T. Drugman, P. Alku, A. Alwan, and B. Yegnanarayana, “Glot- tal source processing: from analysis to applications,” Computer Speech and Language, vol. 28, 09 2014

work page 2014
[25]

Tutorial on Variational Autoencoders,

C. Doersch, “Tutorial on Variational Autoencoders,” 2016

work page 2016
[26]

Con- trollable Text Generation,

Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing, “Con- trollable Text Generation,”CoRR, vol. abs/1703.0, 2017

work page 2017
[27]

Generating Sentences from a Continuous Space,

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. J ´ozefowicz, and S. Bengio, “Generating Sentences from a Continuous Space,” CoRR, vol. abs/1511.0, 2015

work page 2015
[28]

Learning latent rep- resentations for style control and transfer in end-to-end speech synthesis,

Y .-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent rep- resentations for style control and transfer in end-to-end speech synthesis,” CoRR, vol. abs/1812.0, 2018

work page 2018
[29]

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data,

W.-N. Hsu, Y . Zhang, and J. R. Glass, “Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data,” CoRR, vol. abs/1709.0, 2017

work page 2017
[30]

Understanding the difﬁculty of train- ing deep feedforward neural networks

X. Glorot and Y . Bengio, “Understanding the difﬁculty of train- ing deep feedforward neural networks.” in AISTATS, ser. JMLR Proceedings, Y . W. Teh and D. M. Titterington, Eds., vol. 9. JMLR.org, 2010, pp. 249–256

work page 2010
[31]

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

T. Chen, M. Li, Y . Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems,” CoRR, vol. abs/1512.01274, 2015. [Online]. Available: http://arxiv.org/abs/1512.01274

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

Towards End- to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,

R. J. Skerry-Ryan, E. Battenberg, Y . Xiao, Y . Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End- to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,”CoRR, vol. abs/1803.0, 2018

work page 2018
[33]

Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical Machine Transla- tion,

K. Cho, B. van Merrienboer, C ¸ . G ¨ulc ¸ehre, F. Bougares, H. Schwenk, and Y . Bengio, “Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical Machine Transla- tion,” CoRR, vol. abs/1406.1, 2014

work page 2014
[34]

Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,” Journal of Machine Learning Re- search, vol. 15, pp. 1929–1958, 2014

work page 1929
[35]

Tacotron: {A} Fully End-to-End Text-To-Speech Synthesis Model,

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: {A} Fully End-to-End Text-To-Speech Synthesis Model,”CoRR, vol. abs/1703.1, 2017

work page 2017
[36]

Attention Is All You Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” CoRR, vol. abs/1706.0, 2017

work page 2017
[37]

Dysarthric Speech Database for Universal Access Research,

H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. Huang, K. Watkin, and S. Frame, “Dysarthric Speech Database for Universal Access Research,” INTERSPEECH, 2008

work page 2008
[38]

The TORGO database of acoustic and articulatory speech from speakers with dysarthria,

F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, 2012

work page 2012
[39]

A framework for collecting realistic recordings of dysarthric speech - The homeService corpus,

M. Nicolao, H. Christensen, S. Cunningham, P. Green, and T. Hain, “A framework for collecting realistic recordings of dysarthric speech - The homeService corpus,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, 2016

work page 2016
[40]

A. B. Johnston and D. C. Burnett, WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web . USA: Digital Codex LLC, 2012

work page 2012
[41]

Dysarthric Speech Classiﬁcation Us- ing Glottal Features Computed from Non-words, Words and Sen- tences,

N. P. Narendra and P. Alku, “Dysarthric Speech Classiﬁcation Us- ing Glottal Features Computed from Non-words, Words and Sen- tences,” inInterspeech 2018, 19th Annual Conference of the Inter- national Speech Communication Association, Hyderabad, India, 2-6 September 2018. , B. Yegnanarayana, Ed. ISCA, 2018, pp. 3403–3407

work page 2018
[42]

Signal Estimation from Modiﬁed Short-Time Fourier Transform,

D. W. Grifﬁn and J. S. Lim, “Signal Estimation from Modiﬁed Short-Time Fourier Transform,”IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984

work page 1984
[43]

Comprehensive evaluation of statistical speech waveform synthesis,

T. Merritt, B. Putrycz, A. Nadolski, T. Ye, D. Korzekwa, W. Dolecki, T. Drugman, V . Klimkov, A. Moinet, A. Breen, R. Kuklinski, N. Strom, and R. Barra-Chicote, “Comprehensive evaluation of statistical speech waveform synthesis,” nov 2018

work page 2018
[44]

Disentan- gling Disentanglement in Variational Auto-Encoders,

E. Mathieu, T. Rainforth, N. Siddharth, and Y . W. Teh, “Disentan- gling Disentanglement in Variational Auto-Encoders,” dec 2018

work page 2018

[1] [1]

Introduction Dysarthria is a motor speech disorder manifesting itself by a weakness of muscles controlled by the brain and nervous sys- tem that are used in the process of speech production, such as lips, jaw and throat [1]. Patients with dysarthria produce harsh and breathy speech with abnormal prosodic patterns, such as very low speech rate or ﬂat inton...

work page 2015

[2] [2]

Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech

Related work 2.1. Dysarthria detection Deep neural networks can automatically detect dysarthric pat- terns without any prior expert knowledge [7, 8]. Unfortunately, these models are difﬁcult to interpret because they are usually composed of multiple layers producing multidimensional out- puts with an arbitrary meaning and representation. Contrar- ily, sta...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

The audio and text en- coders produce a low-dimensional dysarthric latent space and a sequential encoding of the input text

Proposed model The model consists of two output networks, jointly trained, with a shared encoder as shown in Figure 1. The audio and text en- coders produce a low-dimensional dysarthric latent space and a sequential encoding of the input text. The audio decoder recon- structs input mel-spectrogram from a dysarthric latent space and encoded text. Logistic ...

work page

[4] [4]

Dysarthric speech database There is no well-established benchmark in the literature to com- pare different models for detecting dysarthria

Experiments 4.1. Dysarthric speech database There is no well-established benchmark in the literature to com- pare different models for detecting dysarthria. Aside from the most popular dysarthric corpora, UA-Speech [31] and TORGO [32], there are multiple speech databases created for the pur- pose of a speciﬁc study, for example, corpora of 57 dysarthric s...

work page

[5] [5]

The encoder-decoder model factorizes speech into a low-dimensional latent space and en- coding of the input text

Conclusions This paper proposed a novel approach for the detection and re- construction of dysarthric speech. The encoder-decoder model factorizes speech into a low-dimensional latent space and en- coding of the input text. We showed that the latent space con- veys interpretable characteristics of dysarthria, such as intelligi- bility and ﬂuency of speech...

work page

[6] [6]

Nadolski, J

Acknowledgements We would like to thank A. Nadolski, J. Droppo, J. Rohnke and V . Klimkov for insightful discussions on this work

work page

[7] [7]

The American Speech-Language-Hearing Association (ASHA) - Dysarthria,

ASHA, “The American Speech-Language-Hearing Association (ASHA) - Dysarthria,” 2018

work page 2018

[8] [8]

Neuropsychologi- cal improvement after posterior fossa arachnoid cyst drainage,

M. L. Cuny, M. Pallone, H. Piana, N. Boddaert, C. Sainte-Rose, L. Vaivre-Douret, P. Piolino, and S. Puget, “Neuropsychologi- cal improvement after posterior fossa arachnoid cyst drainage,” Child’s Nervous System, 2017

work page 2017

[9] [9]

Communication Difﬁculties as a Result of Dementia,

S. Banovic, L. Zunic, and O. Sinanovic, “Communication Difﬁculties as a Result of Dementia,” Materia Socio Medica , vol. 30, no. 2, p. 221, 2018. [Online]. Available: https: //www.ejmanager.com/fulltextpdf.php?mno=302643414

work page 2018

[10] [10]

One in three people born in 2015 will de- velop dementia, new analysis shows,

Alzheimersresearchuk, “One in three people born in 2015 will de- velop dementia, new analysis shows,” 2015

work page 2015

[11] [11]

Speech synthesis technologies for individuals with vocal disabilities: V oice banking and reconstruction,

J. Yamagishi, C. Veaux, S. King, and S. Renals, “Speech synthesis technologies for individuals with vocal disabilities: V oice banking and reconstruction,” Acoustical Science and Technology, vol. 33, no. 1, pp. 1–5, 2012

work page 2012

[12] [12]

Combining neural network and rule-based systems for dysarthria diagnosis,

J. Carmichael, V . Wan, and P. Green, “Combining neural network and rule-based systems for dysarthria diagnosis,” in Proceedings of the Annual Conference of the International Speech Communi- cation Association, INTERSPEECH, 2008

work page 2008

[13] [13]

Excitation Source Analysis of Dysarthric Speech for Early Stage Detection of Dysarthria,

G. Krishna, “Excitation Source Analysis of Dysarthric Speech for Early Stage Detection of Dysarthria,” WSPD, 2018

work page 2018

[14] [14]

A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson’s Disease,

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, and E. N ¨oth, “A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson’s Disease,” in In- terspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., B. Yegnanarayana, Ed. ISCA, 2018,...

work page 2018

[15] [15]

Characterization of atyp- ical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility,

T. H. Falk, W. Y . Chan, and F. Shein, “Characterization of atyp- ical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility,” Speech Communication, 2012

work page 2012

[16] [16]

Automated Dysarthria Severity Clas- siﬁcation for Improved Objective Intelligibility Assessment of Spastic Dysarthric Speech

M. Sarria-Paja and T. Falk, “Automated Dysarthria Severity Clas- siﬁcation for Improved Objective Intelligibility Assessment of Spastic Dysarthric Speech.” in Interspeech, 2012

work page 2012

[17] [17]

Cross-database models for the classiﬁcation of dysarthria presence,

S. Gillespie, Y . Y . Logan, E. Moore, J. Laures-Gore, S. Rus- sell, and R. Patel, “Cross-database models for the classiﬁcation of dysarthria presence,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTER- SPEECH, 2017

work page 2017

[18] [18]

V owel Acoustics in Dysarthria: Speech Disorder Diagnosis and Classiﬁcation,

K. L. Lansford and J. M. Liss, “V owel Acoustics in Dysarthria: Speech Disorder Diagnosis and Classiﬁcation,”Journal of Speech Language and Hearing Research, 2014

work page 2014

[19] [19]

Interpretable Objective Assess- ment of Dysarthric Speech Based on Deep Neural Networks,

M. Tu, V . Berisha, and J. Liss, “Interpretable Objective Assess- ment of Dysarthric Speech Based on Deep Neural Networks,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed. ISCA, 2017, pp. 1849–1853

work page 2017

[20] [20]

www.modeltalker.com

Modeltalker, “www.modeltalker.com.”

work page

[21] [21]

Effect of data reduction on sequence-to-sequence neural {TTS},

J. Latorre, J. Lachowicz, J. Lorenzo-Trueba, T. Merritt, T. Drug- man, S. Ronanki, and K. Viacheslav, “Effect of data reduction on sequence-to-sequence neural {TTS},” CoRR, vol. abs/1811.0, 2018

work page 2018

[22] [22]

Re- constructing the voice of an individual following laryngectomy,

Z. Ahmad Khan, P. Green, S. Creer, and S. Cunningham, “Re- constructing the voice of an individual following laryngectomy,” 2011

work page 2011

[23] [23]

Rabiner and R

L. Rabiner and R. Schafer, Digital Processing of Speech Signals. Englewood Cliffs: Prentice Hall, 1978

work page 1978

[24] [24]

Glot- tal source processing: from analysis to applications,

T. Drugman, P. Alku, A. Alwan, and B. Yegnanarayana, “Glot- tal source processing: from analysis to applications,” Computer Speech and Language, vol. 28, 09 2014

work page 2014

[25] [25]

Tutorial on Variational Autoencoders,

C. Doersch, “Tutorial on Variational Autoencoders,” 2016

work page 2016

[26] [26]

Con- trollable Text Generation,

Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing, “Con- trollable Text Generation,”CoRR, vol. abs/1703.0, 2017

work page 2017

[27] [27]

Generating Sentences from a Continuous Space,

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. J ´ozefowicz, and S. Bengio, “Generating Sentences from a Continuous Space,” CoRR, vol. abs/1511.0, 2015

work page 2015

[28] [28]

Learning latent rep- resentations for style control and transfer in end-to-end speech synthesis,

Y .-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent rep- resentations for style control and transfer in end-to-end speech synthesis,” CoRR, vol. abs/1812.0, 2018

work page 2018

[29] [29]

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data,

W.-N. Hsu, Y . Zhang, and J. R. Glass, “Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data,” CoRR, vol. abs/1709.0, 2017

work page 2017

[30] [30]

Understanding the difﬁculty of train- ing deep feedforward neural networks

X. Glorot and Y . Bengio, “Understanding the difﬁculty of train- ing deep feedforward neural networks.” in AISTATS, ser. JMLR Proceedings, Y . W. Teh and D. M. Titterington, Eds., vol. 9. JMLR.org, 2010, pp. 249–256

work page 2010

[31] [31]

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

T. Chen, M. Li, Y . Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems,” CoRR, vol. abs/1512.01274, 2015. [Online]. Available: http://arxiv.org/abs/1512.01274

work page internal anchor Pith review Pith/arXiv arXiv 2015

[32] [32]

Towards End- to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,

R. J. Skerry-Ryan, E. Battenberg, Y . Xiao, Y . Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End- to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,”CoRR, vol. abs/1803.0, 2018

work page 2018

[33] [33]

Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical Machine Transla- tion,

K. Cho, B. van Merrienboer, C ¸ . G ¨ulc ¸ehre, F. Bougares, H. Schwenk, and Y . Bengio, “Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical Machine Transla- tion,” CoRR, vol. abs/1406.1, 2014

work page 2014

[34] [34]

Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,” Journal of Machine Learning Re- search, vol. 15, pp. 1929–1958, 2014

work page 1929

[35] [35]

Tacotron: {A} Fully End-to-End Text-To-Speech Synthesis Model,

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: {A} Fully End-to-End Text-To-Speech Synthesis Model,”CoRR, vol. abs/1703.1, 2017

work page 2017

[36] [36]

Attention Is All You Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” CoRR, vol. abs/1706.0, 2017

work page 2017

[37] [37]

Dysarthric Speech Database for Universal Access Research,

H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. Huang, K. Watkin, and S. Frame, “Dysarthric Speech Database for Universal Access Research,” INTERSPEECH, 2008

work page 2008

[38] [38]

The TORGO database of acoustic and articulatory speech from speakers with dysarthria,

F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, 2012

work page 2012

[39] [39]

A framework for collecting realistic recordings of dysarthric speech - The homeService corpus,

M. Nicolao, H. Christensen, S. Cunningham, P. Green, and T. Hain, “A framework for collecting realistic recordings of dysarthric speech - The homeService corpus,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, 2016

work page 2016

[40] [40]

A. B. Johnston and D. C. Burnett, WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web . USA: Digital Codex LLC, 2012

work page 2012

[41] [41]

Dysarthric Speech Classiﬁcation Us- ing Glottal Features Computed from Non-words, Words and Sen- tences,

N. P. Narendra and P. Alku, “Dysarthric Speech Classiﬁcation Us- ing Glottal Features Computed from Non-words, Words and Sen- tences,” inInterspeech 2018, 19th Annual Conference of the Inter- national Speech Communication Association, Hyderabad, India, 2-6 September 2018. , B. Yegnanarayana, Ed. ISCA, 2018, pp. 3403–3407

work page 2018

[42] [42]

Signal Estimation from Modiﬁed Short-Time Fourier Transform,

D. W. Grifﬁn and J. S. Lim, “Signal Estimation from Modiﬁed Short-Time Fourier Transform,”IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984

work page 1984

[43] [43]

Comprehensive evaluation of statistical speech waveform synthesis,

T. Merritt, B. Putrycz, A. Nadolski, T. Ye, D. Korzekwa, W. Dolecki, T. Drugman, V . Klimkov, A. Moinet, A. Breen, R. Kuklinski, N. Strom, and R. Barra-Chicote, “Comprehensive evaluation of statistical speech waveform synthesis,” nov 2018

work page 2018

[44] [44]

Disentan- gling Disentanglement in Variational Auto-Encoders,

E. Mathieu, T. Rainforth, N. Siddharth, and Y . W. Teh, “Disentan- gling Disentanglement in Variational Auto-Encoders,” dec 2018

work page 2018