pith. sign in

arxiv: 1907.02526 · v1 · pith:E7HWBJZNnew · submitted 2019-07-03 · 💻 cs.SD · cs.LG· eess.AS

Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients

Pith reviewed 2026-05-25 09:12 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS
keywords speech enhancementcochlear implantconvolutional neural networkWiener filterenvelope coefficient measurecausal networkfilter-bank features
0
0 comments X

The pith

Convolutional neural networks in cochlear filter-bank features improve speech enhancement for cochlear implant users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes performing speech enhancement directly in a cochlear filter-bank feature space tailored to CI auditory stimuli, using convolutional neural networks to separate stationary and non-stationary noise from speech. Three architectures are introduced: a vanilla CNN that directly outputs the enhanced signal, an SS-CNN that predicts and subtracts noise, and a Wiener-CNN that estimates an optimal suppression mask; causal versions of each are also developed to enable real-time use. Experiments on these networks show significant gains over baseline systems, with the causal Wiener-CNN producing the highest envelope coefficient measure scores. This positions the method as a practical preprocessor option for CI devices in noisy settings.

Core claim

By operating convolutional neural networks in a cochlear filter-bank feature space, the proposed vanilla, spectral-subtraction-style, and Wiener-style networks (both causal and non-causal) achieve significant improvement over existing baseline systems for speech enhancement in cochlear implant recipients, with the causal Wiener-CNN delivering the best overall envelope coefficient measure.

What carries the argument

Wiener-style CNN that generates an optimal mask for noise suppression within the cochlear filter-bank feature space.

If this is right

  • The proposed networks achieve significant improvement over existing baseline systems.
  • Causal Wiener-CNN outperforms other networks.
  • Causal Wiener-CNN leads to the best overall envelope coefficient measure.
  • The algorithms represent a viable option for implementation on the CCi-MOBILE research platform as a pre-processor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Direct comparison of ECM scores against actual word recognition rates from CI users would test whether the reported metric gains reliably predict perceptual benefit.
  • Extending the same filter-bank CNN approach to other auditory prostheses or different noise environments could reveal broader applicability.
  • Embedding the causal Wiener-CNN into existing CI signal-processing pipelines would allow direct measurement of end-to-end latency and power impact.

Load-bearing premise

Gains measured on the envelope coefficient measure in the reported test conditions will correspond to improved speech intelligibility for cochlear implant users under varied real-world noise conditions.

What would settle it

A listening test with actual cochlear implant users in naturalistic noisy environments that finds no intelligibility improvement despite higher envelope coefficient measure scores.

Figures

Figures reproduced from arXiv: 1907.02526 by John H.L. Hansen, Nursadul Mamun, Soheil Khorram.

Figure 1
Figure 1. Figure 1: Cochlear implant electrode stimulation response shown as an electrodogram. portant auditory features employed with a CIS-Continuous In￾terleaved Sampling strategy. (6) Finally, biphasic pulses are generated from the selected features and sent to the UTDallas CCi-MOBILE research interface board through electrical stim￾ulations [12]. These electrical stimulations can be visualized using electrodograms. An el… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Block diagram of the standard CNN used in this paper. (b) Block diagram of the causal convolutional network (Causal CNN) that leverages causal convolutional kernels in each layer. The causal kernels consider only previous samples of the signals. (c) Various SE systems proposed in this paper; we incorporate both CNN and causal CNN in three different network architectures: Vanila, spectral-subtraction-st… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean speech Intelligibility score based on the ECM measure as a function of SNR for proposed Non-causal SE al￾gorithms. Noise environments: (a) Car 1: Mitsubishi Galant (2002) (b) Car 2: Nissan-Sentra (2008). development, and test sets. Train set includes 3150 utterance and is used to train our CNNs. Development set contains 1575 utterances and is used to tune the network hyper-parameters. Test set contain… view at source ↗
read the original abstract

Attempts to develop speech enhancement algorithms with improved speech intelligibility for cochlear implant (CI) users have met with limited success. To improve speech enhancement methods for CI users, we propose to perform speech enhancement in a cochlear filter-bank feature space, a feature-set specifically designed for CI users based on CI auditory stimuli. We leverage a convolutional neural network (CNN) to extract both stationary and non-stationary components of environmental acoustics and speech. We propose three CNN architectures: (1) vanilla CNN that directly generates the enhanced signal; (2) spectral-subtraction-style CNN (SS-CNN) that first predicts noise and then generates the enhanced signal by subtracting noise from the noisy signal; (3) Wiener-style CNN (Wiener-CNN) that generates an optimal mask for suppressing noise. An important problem of the proposed networks is that they introduce considerable delays, which limits their real-time application for CI users. To address this, this study also considers causal variations of these networks. Our experiments show that the proposed networks (both causal and non-causal forms) achieve significant improvement over existing baseline systems. We also found that causal Wiener-CNN outperforms other networks, and leads to the best overall envelope coefficient measure (ECM). The proposed algorithms represent a viable option for implementation on the CCi-MOBILE research platform as a pre-processor for CI users in naturalistic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes performing speech enhancement directly in a cochlear filter-bank feature space using three CNN architectures (vanilla CNN, spectral-subtraction-style CNN, and Wiener-style CNN) and their causal variants. The central empirical claim is that these networks, particularly the causal Wiener-CNN, achieve significant improvement over existing baselines on the envelope coefficient measure (ECM) and represent a viable real-time option for the CCi-MOBILE platform.

Significance. The work targets a clinically relevant application by aligning the enhancement domain with CI auditory stimuli and explicitly addressing latency via causal networks. If the ECM gains are robust and reproducible, the approach could support practical deployment; the causal variants are a clear practical contribution.

major comments (2)
  1. [Abstract and Experimental Results] Abstract and Experimental Results section: the claim that the networks 'achieve significant improvement over existing baseline systems' is presented without dataset sizes, number of noise conditions or subjects, baseline system descriptions, statistical tests, or cross-validation details. These omissions make the central empirical result impossible to evaluate from the reported text.
  2. [Abstract] Abstract: the stated goal is improved speech intelligibility for CI users in naturalistic environments, yet only ECM is reported; no listening tests, word-recognition scores, or correlation analysis between ECM and intelligibility under the cited real-world conditions are provided. This leaves the link between the measured gains and the clinical motivation untested.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'leads to the best overall envelope coefficient measure (ECM)' is ambiguous; clarify whether this means the highest ECM score or another aggregate.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed comments. We address each major point below, providing clarifications from the manuscript and indicating where revisions will be made to improve evaluability.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: the claim that the networks 'achieve significant improvement over existing baseline systems' is presented without dataset sizes, number of noise conditions or subjects, baseline system descriptions, statistical tests, or cross-validation details. These omissions make the central empirical result impossible to evaluate from the reported text.

    Authors: The manuscript's Experimental Setup section specifies the dataset (TIMIT utterances mixed with NOISEX-92 noises at multiple SNRs, with training/test splits and cross-validation folds), number of noise conditions, baseline systems (e.g., spectral subtraction and Wiener filtering variants), and statistical tests (paired t-tests on ECM scores across conditions). The abstract and results summary are intentionally concise. To address the concern, we will expand the abstract with key parameters (dataset size, noise conditions, and note on statistical testing) while keeping it within length limits. revision: yes

  2. Referee: [Abstract] Abstract: the stated goal is improved speech intelligibility for CI users in naturalistic environments, yet only ECM is reported; no listening tests, word-recognition scores, or correlation analysis between ECM and intelligibility under the cited real-world conditions are provided. This leaves the link between the measured gains and the clinical motivation untested.

    Authors: ECM was selected as the primary metric because it directly quantifies improvements in the envelope coefficients that form the input to CI processors, aligning with the paper's focus on feature-space enhancement for the CCi-MOBILE platform. The manuscript cites prior work linking envelope measures to intelligibility but does not include new listening tests or word-recognition data, as these require CI user recruitment and were outside the scope of this objective evaluation study. We will add a brief discussion paragraph citing established correlations between ECM and intelligibility from the CI literature. revision: partial

standing simulated objections not resolved
  • Absence of new subjective listening tests or word-recognition scores to directly validate ECM gains against clinical intelligibility outcomes in naturalistic conditions.

Circularity Check

0 steps flagged

No circularity: experimental results rest on independent test-set comparisons

full rationale

The paper proposes three CNN architectures (vanilla, SS-CNN, Wiener-CNN) and their causal variants for speech enhancement in cochlear filter-bank space, then reports ECM improvements on held-out test data versus baselines. No equations, derivations, or parameter-fitting steps are described that would make any reported 'prediction' equivalent to an input by construction. No self-citation chains or uniqueness theorems are invoked to justify the architectures or metrics. The central claim is therefore an empirical comparison whose validity can be checked against external benchmarks without reducing to the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that CNNs can learn useful mappings from the chosen filter-bank features to cleaner envelopes and that ECM is a sufficient proxy for clinical benefit; no new physical entities or ad-hoc constants are introduced beyond standard neural-network training.

axioms (1)
  • domain assumption Convolutional networks can extract both stationary and non-stationary acoustic components when trained on the cochlear filter-bank representation.
    Invoked in the description of the three proposed architectures.

pith-pipeline@v0.9.0 · 5782 in / 1227 out tokens · 21576 ms · 2026-05-25T09:12:11.376770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

  1. [1]

    Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients

    Introduction A cochlear implant (CI) is an implantable electronic device that provides the necessary sensation for hearing [1, 2, 3]; CI par- tially restores hearing ability for subjects with sensorineural hearing loss (generally profound hearing loss). According to a report by the U.S. Food and Drug Administration, over 96000 people in US (324,000 people...

  2. [2]

    We then explain details of the proposed SE algorithms

    Methodology In this section, we first briefly introduce the CI pipeline. We then explain details of the proposed SE algorithms. We also describe the computation of the objective speech intelligibil- ity score designed for the CI users. We finally discuss exist- ing baseline SE systems as well as different components of the proposed algorithms. 2.1. Cochlear ...

  3. [3]

    UT-Drive

    Experiments In this section, we compare the performance of the proposed and the baseline SE algorithms. 3.1. Dataset We use “UT-Drive” corpora to perform the experiments in this study [34]. UT-Drive is a large-scale database of noise signals collected across different vehicle platforms under a wide range of field driving conditions. The database contains t...

  4. [4]

    The contribution of this study is threefold

    Conclusion The main goal of this study has been to propose a set of CNN- based SE algorithms that could be useful for CI users in nat- uralistic noisy conditions. The contribution of this study is threefold. First, we extracted speech features from noisy sig- nal based on CI auditory features. The extracted features were used in the proposed SE algorithms...

  5. [5]

    R01 DC016839-02)

    Acknowledgement This work was primarily supported by a National Institute on Deafness and Other Communication Disorders (NIDCD) Grant (No. R01 DC016839-02)

  6. [6]

    Cochlear implant failures and reimplantation: A 30-year analysis and liter- ature review,

    C. Lane, K. Zimmerman, S. Agrawal, and L. Parnes, “Cochlear implant failures and reimplantation: A 30-year analysis and liter- ature review,”The Laryngoscope, 2019

  7. [7]

    Near physiological spectral selectivity of cochlear op- togenetics,

    A. Dieter, C. J. Duque-Afonso, V . Rankovic, M. Jeschke, and T. Moser, “Near physiological spectral selectivity of cochlear op- togenetics,” Nature communications, vol. 10, 2019

  8. [8]

    The cci-mobile vocoder,

    H. Ali, N. Mamun, A. Bruggeman, R. C. M. Chandra Shekar, J. N. Saba, and J. H. L. Hansen, “The cci-mobile vocoder,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1872–1872, 2018

  9. [9]

    (2014) National institute on deafness and other communication disorders, cochlear implants

    NIDCD and NIH. (2014) National institute on deafness and other communication disorders, cochlear implants. [Online]. Available: http:////www.nidcd.nih.gov/health/hearing/pages/coch.aspx/

  10. [10]

    Cochlear implants: system design, integration, and evaluation,

    F.-G. Zeng, S. Rebscher, W. Harrison, X. Sun, and H. Feng, “Cochlear implants: system design, integration, and evaluation,” IEEE reviews in biomedical engineering, pp. 115–142, 2008

  11. [11]

    An auditory-masking-threshold-based noise suppression algo- rithm gmmse-amt [erb] for listeners with sensorineural hearing loss,

    A. Natarajan, J. H. L. Hansen, K. H. Arehart, and J. Rossi-Katz, “An auditory-masking-threshold-based noise suppression algo- rithm gmmse-amt [erb] for listeners with sensorineural hearing loss,” EURASIP Journal on Advances in Signal Processing , vol. 2005, no. 18, p. 678405, 2005

  12. [12]

    Speech recognition in noise as a function of the number of spectral chan- nels: Comparison of acoustic hearing and cochlear implants,

    L. M. Friesen, R. V . Shannon, D. Baskent, and X. Wang, “Speech recognition in noise as a function of the number of spectral chan- nels: Comparison of acoustic hearing and cochlear implants,”The Journal of the Acoustical Society of America, vol. 110, no. 2, pp. 1150–1163, 2001

  13. [13]

    P. C. Loizou, Speech enhancement: theory and practice . CRC press, 2007

  14. [14]

    Speech enhancement for cochlear implant recipients,

    D. Wang and J. H. L. Hansen, “Speech enhancement for cochlear implant recipients,” The Journal of the Acoustical Society of America, vol. 143, no. 4, pp. 2244–2254, 2018

  15. [15]

    Speech enhancement based on generalized minimum mean square er- ror estimators and masking properties of the auditory system,

    J. H. L. Hansen, V . Radhakrishnan, and K. H. Arehart, “Speech enhancement based on generalized minimum mean square er- ror estimators and masking properties of the auditory system,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 2049–2063, 2006

  16. [16]

    Speech enhancement - an overview and recent ad- vances,

    A. Dieter, C. J. Duque-Afonso, V . Rankovic, M. Jeschke, and T. Moser, “Speech enhancement - an overview and recent ad- vances,” Encyclopedia of Electrical and Electronics Engineering, vol. 20, pp. 159–175, 1999

  17. [17]

    Cci-mobile: Design and eval- uation of a cochlear implant and hearing aid research platform for speech scientists and engineers

    J. H. L. Hansen, H. Ali, J. Saba, R. C. shekhar, N. Mamun, R. Ghosh, and A. Brueggeman, “Cci-mobile: Design and eval- uation of a cochlear implant and hearing aid research platform for speech scientists and engineers.”IEEE EMBS Inter Conf. Biomed- ical and health informatics (BHI-19), Chicago, IL , May 19-22, 2019

  18. [18]

    Quantifying cochlear implant users’ ability for speaker identification using ci auditory stimuli

    N. Mamun, R. Ghose, and J. H. Hansen, “Quantifying cochlear implant users’ ability for speaker identification using ci auditory stimuli.” in Interspeech, 2019

  19. [19]

    Suppression of acoustic noise in speech using spectral subtraction,

    S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979

  20. [20]

    An optimum mmse post- filter for adaptive noise cancellation in automobile environment,

    S. Khorram, H. Sameti, and H. Veisi, “An optimum mmse post- filter for adaptive noise cancellation in automobile environment,” in 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA). IEEE, 2012, pp. 431–435

  21. [21]

    A signal subspace approach for speech enhancement,

    Y . Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Transactions on speech and audio processing, vol. 3, no. 4, pp. 251–266, 1995

  22. [22]

    Visually derived wiener filters for speech enhancement,

    I. Almajai and B. Milner, “Visually derived wiener filters for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1642–1651, 2011

  23. [23]

    Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users,

    T. Goehring, F. Bolner, J. J. Monaghan, B. van Dijk, A. Zarowski, and S. Bleeck, “Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users,” Hearing research, vol. 344, pp. 183–194, 2017

  24. [24]

    Multiple-target deep learning for lstm-rnn based speech enhancement,

    L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learning for lstm-rnn based speech enhancement,” in2017 Hands- free Speech Communications and Microphone Arrays (HSCMA) . IEEE, 2017, pp. 136–140

  25. [25]

    A regression ap- proach to speech enhancement based on deep neural networks,

    Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap- proach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 23, no. 1, pp. 7–19, 2015

  26. [26]

    Snr-aware convolutional neural network modeling for speech enhancement

    S.-W. Fu, Y . Tsao, and X. Lu, “Snr-aware convolutional neural network modeling for speech enhancement.” inInterspeech, 2016, pp. 3768–3772

  27. [27]

    Jointly aligning and predicting continuous emotion annotations,

    S. Khorram, M. McInnis, and E. M. Provost, “Jointly aligning and predicting continuous emotion annotations,” IEEE Transactions on Affective Computing, 2019

  28. [28]

    Raw waveform-based speech enhancement by fully convolutional networks,

    S.-W. Fu, Y . Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in 2017 Asia-Pacific Signal and Information Processing Association An- nual Summit and Conference (APSIPA ASC), 2017, pp. 006–012

  29. [29]

    Probabilistic per- mutation invariant training for speech separation,

    M. Yousefi, S. Khorram, and J. H. L. Hansen, “Probabilistic per- mutation invariant training for speech separation,” Proc. Inter- speech, 2019

  30. [30]

    Compensation for do- main mismatch in text-independent speaker recognition,

    F. Bahmaninezhad and J. H. L. Hansen, “Compensation for do- main mismatch in text-independent speaker recognition,” Proc. Interspeech 2018, pp. 1071–1075, 2018

  31. [31]

    Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,

    S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,” Proc. Interspeech 2017, pp. 1253–1257, 2017

  32. [32]

    Prediction of speech intelligibility using a neurogram orthogonal polynomial measure (nopm),

    N. Mamun, W. A. Jassim, and M. S. Zilany, “Prediction of speech intelligibility using a neurogram orthogonal polynomial measure (nopm),” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 23, no. 4, pp. 760–773, 2015

  33. [33]

    Predicting speech intelligibility with the regeneration of envelope from tfs cues for hearing impaired listeners,

    K. Akter and N. Mamun, “Predicting speech intelligibility with the regeneration of envelope from tfs cues for hearing impaired listeners,” in International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE, 2019, pp. 1–5

  34. [34]

    Measuring speech perception with recovered envelope cues using the periph- eral auditory model,

    N. Mamun, K. Akter, H. Ali, and J. H. L. Hansen, “Measuring speech perception with recovered envelope cues using the periph- eral auditory model,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1872–1872, 2018

  35. [35]

    Predicting the speech recep- tion threshold of cochlear implant listeners using an envelope- correlation based measure,

    N. Yousefian and P. C. Loizou, “Predicting the speech recep- tion threshold of cochlear implant listeners using an envelope- correlation based measure,” The Journal of the Acoustical Society of America, vol. 132, no. 5, pp. 3399–3405, 2012

  36. [36]

    Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,

    Y . Ephraim and D. Malah, “Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443–445, 1985

  37. [37]

    Speech enhancement based on wavelet thresholding the multitaper spectrum,

    Y . Hu and P. C. Loizou, “Speech enhancement based on wavelet thresholding the multitaper spectrum,” IEEE transactions on Speech and Audio processing, vol. 12, no. 1, pp. 59–67, 2004

  38. [38]

    Speech enhancement based on a priori signal to noise estimation,

    P. Scalart et al., “Speech enhancement based on a priori signal to noise estimation,” in ICASSP, vol. 2. IEEE, 1996, pp. 629–632

  39. [39]

    In-vehicle speech and noise corpora,

    N. Krishnamurthy, R. Lubag, and J. H. L. Hansen, “In-vehicle speech and noise corpora,” in Digital Signal Processing for In- Vehicle Systems and Safety. Springer, 2012, pp. 145–157

  40. [40]

    Speech database development at mit: Timit and beyond,

    V . Zue, S. Seneff, and J. Glass, “Speech database development at mit: Timit and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990

  41. [41]

    Progressive neural networks for transfer learning in emotion recognition,

    J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,” Interspeech 2017, pp. 1098–1102, 2017