Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement

Alzahra Badi; David K. Han; Hanseok Ko; Sangwook Park

arxiv: 1907.11361 · v1 · pith:7ROQNYMLnew · submitted 2019-07-26 · 📡 eess.AS · cs.LG· cs.SD

Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement

Alzahra Badi , Sangwook Park , David K. Han , Hanseok Ko This is my paper

Pith reviewed 2026-05-24 15:36 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords denoising autoencoderspeech feature enhancementnoise robust ASRcorrelation distanceskip connectionsword error rateend-to-end speech recognition

0 comments

The pith

A denoising autoencoder with skip connections and correlation distance penalty reduces word error rates for noisy speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CDSK-DAE, a variant of denoising autoencoder designed to enhance speech features for noise-robust end-to-end automatic speech recognition. Skip connections pass target frame information from input through both encoder and decoder, while a new objective function incorporates correlation distance to penalize mismatches in dependency between latent target features and the model's latent and enhanced outputs. Tests on seven background noise types at SNRs of 0, 5, 10, and 20 dB show lower average word error rates than both a conventional model and a state-of-the-art model for both seen and unseen noises, with gains holding for linear and non-linear penalty terms.

Core claim

The CDSK-DAE uses skip connections on encoder and decoder sides to pass speech information of the target frame, paired with an objective function that applies correlation distance in penalty terms to measure dependency between latent target features and the DAE outputs (latent features and enhanced features), resulting in lower overall average word error rates under noisy conditions for both seen and unseen environments compared to state-of-the-art models.

What carries the argument

Skip connections passing target-frame speech information combined with correlation distance measures in the objective function's penalty terms to capture dependency between latent target features and model outputs.

If this is right

The method yields lower average word error rates than the state-of-the-art model on both seen and unseen noise conditions.
Improvements hold when using either linear or non-linear penalty terms in the objective function.
The approach applies directly to feature enhancement in end-to-end ASR systems exposed to noise absent from training data.
Performance gains appear across seven noise types at four different SNR levels (0, 5, 10, 20 dB).

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The correlation distance penalty could be tested in autoencoders for other sequential data enhancement tasks such as music or environmental sound processing.
Combining skip connections with dependency measures might reduce the need for large amounts of paired clean-noisy training data in related enhancement problems.
Evaluating the model on real recorded noisy speech rather than added noise would check whether the gains transfer outside simulated conditions.

Load-bearing premise

The correlation distance measure in the penalty terms accurately captures the dependency between latent target features and the DAE's outputs.

What would settle it

An experiment in which the CDSK-DAE produces equal or higher average word error rates than the state-of-the-art model across a new collection of unseen noise types at the tested SNR levels would falsify the reported improvement.

Figures

Figures reproduced from arXiv: 1907.11361 by Alzahra Badi, David K. Han, Hanseok Ko, Sangwook Park.

**Figure 2.** Figure 2: Distance correlation matrix during training be [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Distance correlation matrix during training be [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Performance of learning based Automatic Speech Recognition (ASR) is susceptible to noise, especially when it is introduced in the testing data while not presented in the training data. This work focuses on a feature enhancement for noise robust end-to-end ASR system by introducing a novel variant of denoising autoencoder (DAE). The proposed method uses skip connections in both encoder and decoder sides by passing speech information of the target frame from input to the model. It also uses a new objective function in training model that uses a correlation distance measure in penalty terms by measuring dependency of the latent target features and the model (latent features and enhanced features obtained from the DAE). Performance of the proposed method was compared against a conventional model and a state of the art model under both seen and unseen noisy environments of 7 different types of background noise with different SNR levels (0, 5, 10 and 20 dB). The proposed method also is tested using linear and non-linear penalty terms as well, where, they both show an improvement on the overall average WER under noisy conditions both seen and unseen in comparison to the state-of-the-art model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDSK-DAE adds skip connections and a correlation-distance penalty to DAE but provides no ablation isolating the penalty's role in the reported WER gains.

read the letter

The paper's main move is to modify a denoising autoencoder for speech feature enhancement by adding skip connections that pass target-frame information through both encoder and decoder, plus a new training objective that adds a correlation-distance penalty to capture dependency between latent target features and the model's outputs. They test linear and nonlinear versions of that penalty and report lower average WER than a conventional DAE and a state-of-the-art baseline across seven noise types at 0-20 dB SNR, including unseen conditions. That combination is new, and the focus on unseen noise plus the dependency-aware loss is a reasonable practical step for ASR robustness work. The experiments are set up sensibly for the claim they want to make. The clearest weakness is the missing ablation: nothing shows what happens when you keep the skip connections but drop the correlation term, so the abstract leaves open whether the penalty itself is responsible for any gain or whether the architectural skips alone would produce the same result. The summary also gives no quantitative deltas, confidence intervals, or per-condition breakdowns, which makes it hard to judge the size or reliability of the improvement. This is incremental work aimed at people already tuning autoencoders for noisy speech features. A reader who needs a controlled demonstration that the correlation penalty adds value beyond skips will not get it here. The paper is coherent on its own terms and engages the right literature, so it deserves peer review once the full version supplies the missing controls and numbers; without them it would be a borderline case.

Referee Report

1 major / 2 minor

Summary. The paper proposes the Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for speech feature enhancement to improve noise robustness in end-to-end ASR. The architecture adds skip connections on both encoder and decoder sides to pass target-frame speech information from the input. A novel objective function augments the reconstruction loss with correlation-distance penalty terms that measure dependency between latent target features and the DAE outputs (both latent and enhanced features). Linear and non-linear variants of the penalty are tested. Experiments compare the method against a conventional DAE and a state-of-the-art model on seven noise types at SNRs of 0, 5, 10 and 20 dB under both seen and unseen noise conditions, claiming lower average word error rate (WER) for the proposed approach.

Significance. If the reported WER gains are reproducible and attributable to the correlation-distance term rather than the skip connections alone, the work would supply a concrete, dependency-aware training objective that can be combined with skip-connection DAEs. This could modestly advance feature-enhancement pipelines for ASR in mismatched noise conditions.

major comments (1)

[Experiments / Results section (no equation or table number supplied for the objective)] The headline claim attributes WER reductions to the new objective function that uses correlation distance to capture dependency. However, the manuscript reports results for linear and non-linear penalty variants but does not describe an ablation that retains the skip connections and base reconstruction loss while removing only the correlation-distance penalty. Without this control experiment it is impossible to determine whether any observed improvement over the SOTA baseline is driven by the claimed dependency measure or simply by the architectural skip connections. This issue is load-bearing for the central attribution in the abstract and results.

minor comments (2)

[Abstract] The abstract asserts quantitative improvement but supplies no numerical WER values, confidence intervals, or per-noise-type tables; readers must reach the results section to evaluate effect size.
[Methods / Objective function] The description of the correlation-distance term (“measuring dependency of the latent target features and the model (latent features and enhanced features obtained from the DAE)”) is informal; the exact mathematical definition (e.g., which correlation coefficient or distance, how it is normalized, and its weighting relative to the reconstruction loss) should be stated explicitly with an equation in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The single major comment raises a valid point about experimental controls, which we address below.

read point-by-point responses

Referee: [Experiments / Results section (no equation or table number supplied for the objective)] The headline claim attributes WER reductions to the new objective function that uses correlation distance to capture dependency. However, the manuscript reports results for linear and non-linear penalty variants but does not describe an ablation that retains the skip connections and base reconstruction loss while removing only the correlation-distance penalty. Without this control experiment it is impossible to determine whether any observed improvement over the SOTA baseline is driven by the claimed dependency measure or simply by the architectural skip connections. This issue is load-bearing for the central attribution in the abstract and results.

Authors: We agree that the current manuscript does not include an ablation retaining the skip connections and base reconstruction loss while removing only the correlation-distance penalty terms. This omission makes it difficult to isolate the contribution of the correlation-distance component from the architectural changes. In the revised version we will add this control experiment: a skip-connection DAE trained solely with the reconstruction loss (no correlation-distance penalties), with results reported alongside the linear and non-linear CDSK-DAE variants under the same seen/unseen noise conditions. This will permit clearer attribution of performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical proposal of new architecture and loss

full rationale

The paper defines a CDSK-DAE with skip connections and a correlation-distance penalty term in the training objective, then reports empirical WER improvements versus baselines under seen/unseen noise. No derivation chain is presented that reduces a claimed result to its own fitted parameters or self-citations by construction. The central performance claim rests on direct experimental comparison rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, mathematical axioms, or new invented entities; relies on standard concepts in machine learning for speech processing.

pith-pipeline@v0.9.0 · 5749 in / 1126 out tokens · 36837 ms · 2026-05-24T15:36:06.517407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

A bit of progress in language modeling

Goodman JT. A bit of progress in language modeling. Computer Speech and Language. 2001 aug;15(4):403– 434

work page 2001
[2]

Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition

Dahl GE, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing. 2012;20(1):30–42

work page 2012
[3]

Acoustic modeling using deep belief networks

Mohamed AR, Dahl GE, Hinton G. Acoustic modeling using deep belief networks. IEEE Transactions on Au- dio, Speech and Language Processing. 2012;20(1):14– 22

work page 2012
[4]

Deep belief net- works for phone recognition

Mohamed Ar, Dahl G, Hinton G. Deep belief net- works for phone recognition. In: NIPS Workshop Deep Learning for Speech Recognition and Related Applications. vol. 1; 2009. p. 39

work page 2009
[5]

Grapheme-to- phoneme conversion using Long Short-Term Memory recurrent neural networks

Rao K, Peng F, Sak H, Beaufays F. Grapheme-to- phoneme conversion using Long Short-Term Memory recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE; 2015. p. 4225–4229

work page 2015
[6]

Extensions of recurrent neural network language model

Mikolov T, Kombrink S, Burget L, Černocký J, Khu- danpur S. Extensions of recurrent neural network language model. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2011. p. 5528–5531

work page 2011
[7]

A pruned rnnlm lattice-rescoring algorithm for au- tomatic speech recognition

Xu H, Chen T, Gao D, Wang Y, Li K, Goel N, et al. A pruned rnnlm lattice-rescoring algorithm for au- tomatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE; 2018. p. 5929–5933

work page 2018
[8]

LSTM neural networks for language modeling

Sundermeyer M, Schlueter R, Ney H. LSTM neural networks for language modeling. In: INTERSPEECH

work page
[9]

Deep speech: Scaling up end-to-end speech recognition

Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:14125567. 2014

work page 2014
[10]

Towards end-to-end speech recog- nition with recurrent neural networks

Graves A, Jaitly N. Towards end-to-end speech recog- nition with recurrent neural networks. International conference on machine learning. 2014;p. 1764–1772

work page 2014
[11]

Very deep convolutional networks for end-to-end speech recognition

Zhang Y, Chan W, Jaitly N. Very deep convolutional networks for end-to-end speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 4845– 4849

work page 2017
[12]

EESEN: End- to-end speech recognition using deep RNN mod- els and WFST-based decoding

Miao Y, Gowayyed M, Metze F. EESEN: End- to-end speech recognition using deep RNN mod- els and WFST-based decoding. arXiv preprint arXiv:150708240. 2016;p. 167–174

work page 2016
[13]

Listen, attend and spell: A neural network for large vocabulary con- versational speech recognition

Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: A neural network for large vocabulary con- versational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE; 2016. p. 4960–4964

work page 2016
[14]

Task loss estimation for sequence prediction

Bahdanau D, Serdyuk D, Brakel P, Ke NR, Chorowski J, Courville A, et al. Task loss estimation for sequence prediction. arXiv preprint arXiv:151106456. 2015

work page 2015
[15]

Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural net- works

Graves A, Fern/acute.ts1andez1 S, Gomez F, Schmidhuber J. Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural net- works. In: 23rd International Conference on Machine Learning (ICML); 2006. p. 369–376

work page 2006
[16]

Speech enhance- ment based on deep denoising autoencoder

Lu X, Tsao Y, Matsuda S, Hori C. Speech enhance- ment based on deep denoising autoencoder. In: IN- TERSPEECH; 2013. p. 436–440

work page 2013
[17]

Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition

Feng X, Zhang Y, Glass J. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2014. p. 1759–1763

work page 2014
[18]

Unsupervised domain adaptation for robust speech recognition via vari- ational autoencoder-based data augmentation

Hsu WN, Zhang Y, Glass J. Unsupervised domain adaptation for robust speech recognition via vari- ational autoencoder-based data augmentation. In: IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU); 2017. p. 16–23

work page 2017
[19]

Multi-Task Autoencoder for Noise-Robust Speech Recognition

Zhang H, Liu C, Inoue N, Shinoda K. Multi-Task Autoencoder for Noise-Robust Speech Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 5599–5603

work page 2018
[20]

Measuring and testing dependency by correlation of distances

Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependency by correlation of distances. Annals of Statistics. 2007;35(6):2769–2794

work page 2007
[21]

Eﬀect of diﬀerent distance measures on the performance of K-Means Algorithm: An experimental study in Matlab

Jyoti Bora D, Kumar Gupta A. Eﬀect of diﬀerent distance measures on the performance of K-Means Algorithm: An experimental study in Matlab. Inter- national Journal of Computer Science and Information Technologies (IJCSIT). 2014;5(2):2501–2506

work page 2014
[22]

Ex- tracting and composing robust features with denoising autoencoders

Vincent P, Larochelle H, Bengio Y, Manzagol PA. Ex- tracting and composing robust features with denoising autoencoders. In: 25th International Conference on Machine Learning; 2008. p. 1096–1103

work page 2008
[23]

Speech restora- tion based on deep learning autoencoder with layer- wised pretraining

Lu X, Matsuda S, Hori C, Kashioka H. Speech restora- tion based on deep learning autoencoder with layer- wised pretraining. In: INTERSPEECH; 2012. p. 1504– 1507

work page 2012
[24]

Training Very Deep Networks

Srivastava RK, Greﬀ K, Schmidhuber J. Training Very Deep Networks. In: Advances in Neural Information Processing Systems (NIPS); 2015. p. 2377–2385

work page 2015
[25]

Deep residual learn- ing for image recognition

He K, Zhang X, Ren S, Sun J. Deep residual learn- ing for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770–778

work page 2016
[26]

Image restoration us- ing very deep convolutional encoder-decoder networks with symmetric skip connections

Mao XJ, Shen C, Yang YB. Image restoration us- ing very deep convolutional encoder-decoder networks with symmetric skip connections. In: Advances in Neural Information Processing Systems (NIPS); 2016. p. 2802–2810

work page 2016
[27]

Denoising auto-encoder with re- current skip connections and residual regression for music source separation

Liu JY, Yang YH. Denoising auto-encoder with re- current skip connections and residual regression for music source separation. In: 17th IEEE International July 29, 2019 Conference on Machine Learning and Applications (ICMLA). IEEE; 2018. p. 773–778

work page 2019
[28]

Speech enhancement based on deep neural networks with skip connections

Tu M, Zhang X. Speech enhancement based on deep neural networks with skip connections. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 5565– 5569

work page 2017
[29]

Identity mappings in deep residual networks

He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: European Conference on Computer Vision (ECCV); 2016. p. 630–645

work page 2016
[30]

Residual networks be- have like ensembles of relatively shallow networks

Veit A, Wilber M, Belongie S. Residual networks be- have like ensembles of relatively shallow networks. In: Advances in Neural Information Processing Systems (NIPS); 2016. p. 550–558

work page 2016
[31]

ETSI: EG 202 396-1 v1.2.2; 2008

European Telecommunications Standards Institute. ETSI: EG 202 396-1 v1.2.2; 2008

work page 2008
[32]

ITU-T P.56 Objective measurement of active speech level; 2011

ITU-T. ITU-T P.56 Objective measurement of active speech level; 2011

work page 2011
[33]

Caﬀe: Convolutional architecture for fast feature embedding

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. Caﬀe: Convolutional architecture for fast feature embedding. In: ACM Multimedia

work page
[34]

p. 675–678. July 29, 2019

work page 2019

[1] [1]

A bit of progress in language modeling

Goodman JT. A bit of progress in language modeling. Computer Speech and Language. 2001 aug;15(4):403– 434

work page 2001

[2] [2]

Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition

Dahl GE, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing. 2012;20(1):30–42

work page 2012

[3] [3]

Acoustic modeling using deep belief networks

Mohamed AR, Dahl GE, Hinton G. Acoustic modeling using deep belief networks. IEEE Transactions on Au- dio, Speech and Language Processing. 2012;20(1):14– 22

work page 2012

[4] [4]

Deep belief net- works for phone recognition

Mohamed Ar, Dahl G, Hinton G. Deep belief net- works for phone recognition. In: NIPS Workshop Deep Learning for Speech Recognition and Related Applications. vol. 1; 2009. p. 39

work page 2009

[5] [5]

Grapheme-to- phoneme conversion using Long Short-Term Memory recurrent neural networks

Rao K, Peng F, Sak H, Beaufays F. Grapheme-to- phoneme conversion using Long Short-Term Memory recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE; 2015. p. 4225–4229

work page 2015

[6] [6]

Extensions of recurrent neural network language model

Mikolov T, Kombrink S, Burget L, Černocký J, Khu- danpur S. Extensions of recurrent neural network language model. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2011. p. 5528–5531

work page 2011

[7] [7]

A pruned rnnlm lattice-rescoring algorithm for au- tomatic speech recognition

Xu H, Chen T, Gao D, Wang Y, Li K, Goel N, et al. A pruned rnnlm lattice-rescoring algorithm for au- tomatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE; 2018. p. 5929–5933

work page 2018

[8] [8]

LSTM neural networks for language modeling

Sundermeyer M, Schlueter R, Ney H. LSTM neural networks for language modeling. In: INTERSPEECH

work page

[9] [9]

Deep speech: Scaling up end-to-end speech recognition

Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:14125567. 2014

work page 2014

[10] [10]

Towards end-to-end speech recog- nition with recurrent neural networks

Graves A, Jaitly N. Towards end-to-end speech recog- nition with recurrent neural networks. International conference on machine learning. 2014;p. 1764–1772

work page 2014

[11] [11]

Very deep convolutional networks for end-to-end speech recognition

Zhang Y, Chan W, Jaitly N. Very deep convolutional networks for end-to-end speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 4845– 4849

work page 2017

[12] [12]

EESEN: End- to-end speech recognition using deep RNN mod- els and WFST-based decoding

Miao Y, Gowayyed M, Metze F. EESEN: End- to-end speech recognition using deep RNN mod- els and WFST-based decoding. arXiv preprint arXiv:150708240. 2016;p. 167–174

work page 2016

[13] [13]

Listen, attend and spell: A neural network for large vocabulary con- versational speech recognition

Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: A neural network for large vocabulary con- versational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE; 2016. p. 4960–4964

work page 2016

[14] [14]

Task loss estimation for sequence prediction

Bahdanau D, Serdyuk D, Brakel P, Ke NR, Chorowski J, Courville A, et al. Task loss estimation for sequence prediction. arXiv preprint arXiv:151106456. 2015

work page 2015

[15] [15]

Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural net- works

Graves A, Fern/acute.ts1andez1 S, Gomez F, Schmidhuber J. Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural net- works. In: 23rd International Conference on Machine Learning (ICML); 2006. p. 369–376

work page 2006

[16] [16]

Speech enhance- ment based on deep denoising autoencoder

Lu X, Tsao Y, Matsuda S, Hori C. Speech enhance- ment based on deep denoising autoencoder. In: IN- TERSPEECH; 2013. p. 436–440

work page 2013

[17] [17]

Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition

Feng X, Zhang Y, Glass J. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2014. p. 1759–1763

work page 2014

[18] [18]

Unsupervised domain adaptation for robust speech recognition via vari- ational autoencoder-based data augmentation

Hsu WN, Zhang Y, Glass J. Unsupervised domain adaptation for robust speech recognition via vari- ational autoencoder-based data augmentation. In: IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU); 2017. p. 16–23

work page 2017

[19] [19]

Multi-Task Autoencoder for Noise-Robust Speech Recognition

Zhang H, Liu C, Inoue N, Shinoda K. Multi-Task Autoencoder for Noise-Robust Speech Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 5599–5603

work page 2018

[20] [20]

Measuring and testing dependency by correlation of distances

Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependency by correlation of distances. Annals of Statistics. 2007;35(6):2769–2794

work page 2007

[21] [21]

Eﬀect of diﬀerent distance measures on the performance of K-Means Algorithm: An experimental study in Matlab

Jyoti Bora D, Kumar Gupta A. Eﬀect of diﬀerent distance measures on the performance of K-Means Algorithm: An experimental study in Matlab. Inter- national Journal of Computer Science and Information Technologies (IJCSIT). 2014;5(2):2501–2506

work page 2014

[22] [22]

Ex- tracting and composing robust features with denoising autoencoders

Vincent P, Larochelle H, Bengio Y, Manzagol PA. Ex- tracting and composing robust features with denoising autoencoders. In: 25th International Conference on Machine Learning; 2008. p. 1096–1103

work page 2008

[23] [23]

Speech restora- tion based on deep learning autoencoder with layer- wised pretraining

Lu X, Matsuda S, Hori C, Kashioka H. Speech restora- tion based on deep learning autoencoder with layer- wised pretraining. In: INTERSPEECH; 2012. p. 1504– 1507

work page 2012

[24] [24]

Training Very Deep Networks

Srivastava RK, Greﬀ K, Schmidhuber J. Training Very Deep Networks. In: Advances in Neural Information Processing Systems (NIPS); 2015. p. 2377–2385

work page 2015

[25] [25]

Deep residual learn- ing for image recognition

He K, Zhang X, Ren S, Sun J. Deep residual learn- ing for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770–778

work page 2016

[26] [26]

Image restoration us- ing very deep convolutional encoder-decoder networks with symmetric skip connections

Mao XJ, Shen C, Yang YB. Image restoration us- ing very deep convolutional encoder-decoder networks with symmetric skip connections. In: Advances in Neural Information Processing Systems (NIPS); 2016. p. 2802–2810

work page 2016

[27] [27]

Denoising auto-encoder with re- current skip connections and residual regression for music source separation

Liu JY, Yang YH. Denoising auto-encoder with re- current skip connections and residual regression for music source separation. In: 17th IEEE International July 29, 2019 Conference on Machine Learning and Applications (ICMLA). IEEE; 2018. p. 773–778

work page 2019

[28] [28]

Speech enhancement based on deep neural networks with skip connections

Tu M, Zhang X. Speech enhancement based on deep neural networks with skip connections. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2017. p. 5565– 5569

work page 2017

[29] [29]

Identity mappings in deep residual networks

He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: European Conference on Computer Vision (ECCV); 2016. p. 630–645

work page 2016

[30] [30]

Residual networks be- have like ensembles of relatively shallow networks

Veit A, Wilber M, Belongie S. Residual networks be- have like ensembles of relatively shallow networks. In: Advances in Neural Information Processing Systems (NIPS); 2016. p. 550–558

work page 2016

[31] [31]

ETSI: EG 202 396-1 v1.2.2; 2008

European Telecommunications Standards Institute. ETSI: EG 202 396-1 v1.2.2; 2008

work page 2008

[32] [32]

ITU-T P.56 Objective measurement of active speech level; 2011

ITU-T. ITU-T P.56 Objective measurement of active speech level; 2011

work page 2011

[33] [33]

Caﬀe: Convolutional architecture for fast feature embedding

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. Caﬀe: Convolutional architecture for fast feature embedding. In: ACM Multimedia

work page

[34] [34]

p. 675–678. July 29, 2019

work page 2019