Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition

Khoi-Nguyen C. Mac; Michael Picheny; Wei Zhang; Xiaodong Cui

arxiv: 1907.04887 · v1 · pith:NR74MYQUnew · submitted 2019-07-10 · 📡 eess.AS · cs.CL· cs.SD

Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition

Khoi-Nguyen C. Mac , Xiaodong Cui , Wei Zhang , Michael Picheny This is my paper

Pith reviewed 2026-05-24 23:04 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords mixed-bandwidth acoustic modelingdeep neural networksautomatic speech recognitionwideband speechnarrowband speechbandwidth extensiondistributed training

0 comments

The pith

A single deep neural network acoustic model trained on mixed wideband and narrowband data performs comparably to separate models across diverse test sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether mixed-bandwidth deep neural network acoustic modeling is practical at large scale for automatic speech recognition. It combines 1,150 hours of wideband data with 2,300 hours of narrowband data and tests alignment strategies such as downsampling, upsampling, and bandwidth extension. Performance is measured on eight test sets drawn from varied application domains. Distributed synchronous training on multiple GPUs handles the data volume. If the strategies succeed, one model can serve inputs of either bandwidth, reducing the need for separate systems in deployment.

Core claim

Mixed-bandwidth deep neural network acoustic modeling is practical and effective for ASR. Training jointly on 1,150 hours of wideband data and 2,300 hours of narrowband data with downsampling, upsampling, and bandwidth extension strategies produces models that perform effectively on eight diverse wideband and narrowband test sets from multiple domains; synchronous data-parallel training across GPUs makes the scale feasible.

What carries the argument

The bandwidth adaptation strategies (downsampling, upsampling, and bandwidth extension) that align acoustic features from wideband and narrowband signals for joint DNN training.

If this is right

A single acoustic model suffices for both wideband and narrowband inputs without major performance penalties.
ASR deployment can avoid maintaining and switching between separate bandwidth-specific models.
Over 3,000 hours of mixed data can be leveraged for training through distributed GPU parallelism.
The same adaptation approach applies across test domains from telephony to other applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Storage and maintenance costs for ASR systems could decrease by replacing multiple models with one.
The method might extend naturally to other input variations such as differing noise levels or channel conditions.
Real-time systems could see simpler pipelines when handling mixed-quality audio streams without explicit bandwidth detection.

Load-bearing premise

The bandwidth adaptation strategies can align acoustic features from wideband and narrowband data well enough to support effective joint training.

What would settle it

If the mixed-bandwidth model produces substantially higher word error rates than dedicated wideband or narrowband models on the eight test sets, the claim of practical effectiveness would not hold.

Figures

Figures reproduced from arXiv: 1907.04887 by Khoi-Nguyen C. Mac, Michael Picheny, Wei Zhang, Xiaodong Cui.

**Figure 1.** Figure 1: Illustration of the training of the BWE mapping. The mapping is realized as a CNN with a VGG architecture. Its output is connected to the WB CNN acoustic model after tensor manipulation. The WB CNN is fixed and the BWE CNN is optimized under the CE criterion. 4.3. Distributed Training The networks are optimized under the CE criterion using the SGD algorithm. Learning rate starts as 0.01 for 10 epochs and i… view at source ↗

read the original abstract

In automatic speech recognition (ASR), wideband (WB) and narrowband (NB) speech signals with different sampling rates typically use separate acoustic models. Therefore mixed-bandwidth (MB) acoustic modeling has important practical values for ASR system deployment. In this paper, we extensively investigate large-scale MB deep neural network acoustic modeling for ASR using 1,150 hours of WB data and 2,300 hours of NB data. We study various MB strategies including downsampling, upsampling and bandwidth extension for MB acoustic modeling and evaluate their performance on 8 diverse WB and NB test sets from various application domains. To deal with the large amounts of training data, distributed training is carried out on multiple GPUs using synchronous data parallelism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a solid large-scale empirical check that standard bandwidth adaptation tricks let you train one DNN acoustic model on mixed WB and NB data without big surprises.

read the letter

The paper trains a single DNN acoustic model on 1150 hours of wideband plus 2300 hours of narrowband data and tests several ways to handle the mismatch: downsampling the WB data, upsampling the NB data, or using bandwidth extension. They run the experiments on eight different test sets and use synchronous multi-GPU training to make the scale feasible. That combination of data volume and breadth of test conditions is the main concrete contribution; it moves the mixed-bandwidth question from small-scale demos to something closer to production data sizes. The work is straightforward and addresses a real deployment headache where separate WB and NB models are inconvenient. The distributed training description is routine but necessary to report at this size. The soft spot is that the abstract gives no numbers, so the letter has to assume the full paper shows the usual small degradation on the narrower-band tests and that the authors compare against single-bandwidth baselines. If those gaps are modest and the paper includes the actual WER tables, the claim holds; if the losses turn out larger than expected, the practical value shrinks. No circularity or invented assumptions appear in the design. This paper is for ASR groups that already run large DNN systems and need to merge data sources. It is not paradigm-shifting, but the scale and the explicit multi-test-set evaluation make it worth a referee's time rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper investigates large-scale mixed-bandwidth (MB) deep neural network acoustic modeling for automatic speech recognition (ASR). It trains models on 1,150 hours of wideband (WB) and 2,300 hours of narrowband (NB) data, comparing strategies including downsampling, upsampling, and bandwidth extension. Models are evaluated on eight diverse WB and NB test sets from various domains, with distributed synchronous training on multiple GPUs to handle the data volume. The central claim is that MB modeling is practical and effective for ASR deployment.

Significance. If the empirical results hold, the work has clear practical significance for ASR systems that must handle mixed sampling rates without maintaining separate models. The scale of the training data (over 3,000 hours total) and breadth of evaluation across eight test sets provide a strong test of the strategies' robustness. The use of distributed GPU training is a standard but necessary detail for reproducibility at this scale.

minor comments (3)

[§3] §3 (or equivalent methods section): the description of the bandwidth extension network architecture and training objective should include the precise loss function and any hyperparameter values used, as these directly affect reproducibility of the MB results.
[Table 2] Table 2 (or results table): the WER numbers for the separate WB and NB baselines should be reported alongside the MB models for all eight test sets to allow direct quantification of any degradation from joint training.
[§2] The abstract states the data volumes but the main text should explicitly state the total number of parameters in the DNN acoustic models and the frame rate or feature dimension used, to clarify the scale of the experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, including its practical significance for ASR deployment and the robustness of the large-scale evaluation across eight test sets. The recommendation is for minor revision, but the report contains no specific major comments requiring response.

Circularity Check

0 steps flagged

Empirical study with no derivation chain

full rationale

This paper is an empirical investigation of mixed-bandwidth DNN acoustic modeling. It trains models on 1,150 hours WB + 2,300 hours NB data, applies standard adaptation strategies (downsampling, upsampling, bandwidth extension), performs distributed GPU training, and reports WER on eight test sets. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. All claims rest on experimental outcomes rather than any internal reduction to inputs by construction, so the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are mentioned; the paper is an empirical investigation of existing neural network training techniques applied to mixed-bandwidth data.

pith-pipeline@v0.9.0 · 5663 in / 1080 out tokens · 32860 ms · 2026-05-24T23:04:57.983654+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Introduction Wideband (WB) and narrowband (NB) speech signals are two types of input signals that widely exist in speech-related applica- tions. In automatic speech recognition (ASR), acoustic models are usually separately trained for WB and NB speech data given their distinct spectral characteristics under different sampling rates. From the system deploy...

work page
[2]

Related Work MB acoustic modeling for ASR has been investigated previously under various conditions [2, 3, 4, 5]. In [2], the NB data is used to leverage the training of WB acoustic models in a GMM-HMM framework and the missing components in upsampled NB speech features are dealt with using the expectation-maximization (EM) algorithm [6] for the MB GMM-HM...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Bandwidth Extension BWE has been an active research topic in communication and acoustics processing. NB speech signals, such as telephony speech signals, suffer from degraded quality and intelligibility due to the lack of high frequency spectral information eliminated by the low-pass band limitation of communication channels. Over the years, extensive res...

work page
[4]

Feature Space The WB speech is sampled at 16KHz while the NB speech at 8KHz

System Implementation 4.1. Feature Space The WB speech is sampled at 16KHz while the NB speech at 8KHz. The input feature space consists of 40-dimensional logmel features after application of ﬁrst a global cepstral mean normalization (CMN) and then an utterance-based CMN. There are three input feature maps to the CNNs, the static logmel features and their...

work page
[5]

There are 2,300 hours of NB training data which consists of 2,000 hours of Switchboard data and 300 hours of IBM call center data

Experimental Results There are 1,150 hours of WB training data which consists of 420 hours of Broadcast News data, 450 hours of internal dictation data, 100 hours of meeting data, 140 hours of hospitality (travel and hotel reservation) data and 40 hours of accented data. There are 2,300 hours of NB training data which consists of 2,000 hours of Switchboar...

work page
[6]

The LM training data is selected from a broad variety of sources

The decoding vocabulary comprises of 250K words and the language model (LM) is a 4-gram LM with modiﬁed Kneser-Ney Smoothing consisting of 200M n-grams. The LM training data is selected from a broad variety of sources. Stronger language modeling techniques such as RNN LMs will be investigated in the future. The word error rates (WERs) of various models on...

work page 2000
[7]

Discussion and Summary As can be observed from the breakdown performance in Table 2, consistent improvements of one technique across all test sets are rarely observed. The conclusion drawn from one particu- lar test set by one technique may not generalize to other test sets, although the average WERs can give us a good idea on the overall performance of c...

work page
[8]

and its applications to BWE [ 17], we hope to further im- prove BWE in the context of MB modeling for consistent supe- rior performance

work page
[9]

Very deep convolutional net- works for large-scale image recognition

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition.” in International Confer- ence on Learning Representations (ICLR), 2015

work page 2015
[10]

Training wideband acoustic models using mixed-bandwidth training data for speech recognition,

M. L. Seltzer and A. Acero, “Training wideband acoustic models using mixed-bandwidth training data for speech recognition,”IEEE Transactions on Audio, Speech, and Language Processing,, vol. 15, no. 1, pp. 235–245, 2007

work page 2007
[11]

Improving wideband speech recognition using mixed-bandwidth training data in CD- DNN-HMM,

J. Li, D. Yu, J.-T. Huang, and Y . Gong, “Improving wideband speech recognition using mixed-bandwidth training data in CD- DNN-HMM,” in IEEE Workshop on Spoken Language Technology (SLT), 2012, pp. 131–136

work page 2012
[12]

An experi- mental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition,

J. Gao, J. Du, C. Kong, H. Lu, E. Chen, and C.-H. Lee, “An experi- mental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition,” in International Joint Conference on Neural Networks (IJCNN), 2016, pp. 588–594

work page 2016
[13]

Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling,

J. Gao, J. Du, and E. Chen, “Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , pp. 559–571, March 2019

work page 2019
[14]

Maximum likeli- hood from incomplete data via the EM algorithm,

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likeli- hood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977

work page 1977
[15]

B. Iser, W. Minker, and G. Schmidt,Bandwidth extension of speech signals. Springer, 2008

work page 2008
[16]

Bandwidth extension of speech sig- nals: a comprehensive review,

N. Prasad and T. Kumar, “Bandwidth extension of speech sig- nals: a comprehensive review,”International Journal of Intelligent Systems Technologies and Applications, vol. 2, no. 2, pp. 45–52, 2016

work page 2016
[17]

A harmonic bandwidth extension method for audio codecs,

F. Nagel and S. Disch, “A harmonic bandwidth extension method for audio codecs,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 145–148

work page 2009
[18]

Bandwidth extension of telephone speech using a neural network and a ﬁlter bank implementation for high- band Mel spectrum,

H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a ﬁlter bank implementation for high- band Mel spectrum,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2170–2183, 2011

work page 2011
[19]

Effect of bandwidth extension to telephone speech recognition in cochlear implant users,

C. Liu, Q.-J. Fu, and S. Narayanan, “Effect of bandwidth extension to telephone speech recognition in cochlear implant users,” The Journal of the Acoustical Society of America , vol. 26, no. 5, pp. 77–83, 2018

work page 2018
[20]

Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,

Z.-H. Ling, Y . Ai, Y . Gu, and L.-R. Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 125, no. 2, pp. 883–894, 2009

work page 2009
[21]

DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech,

K. Li, Z. Huang, Y . Xu, , and C.-H. Lee, “DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech,” in Interspeech, 2015, pp. 2578–2582

work page 2015
[22]

Gadei: On scale-up training as a service for deep learning,

W. Zhang, M. Feng, Y . Zheng, Y . Ren, Y . Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and F. Wang, “Gadei: On scale-up training as a service for deep learning,” in The IEEE International Conference on Data Mining (ICDM), 2017

work page 2017
[23]

Nvidia collective communications library (NCCL),

“Nvidia collective communications library (NCCL),” https:// developer.nvidia.com/nccl, accessed: 2018-10-26

work page 2018
[24]

Generative adver- sarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial nets,” Advances in Neural Information Processing Systems (NIPS), 2014

work page 2014
[25]

CycleGAN bandwidth extension acoustic modeling for automatic speech recognition,

D. Haws and X. Cui, “CycleGAN bandwidth extension acoustic modeling for automatic speech recognition,” in International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 6780–6784

work page 2019

[1] [1]

Introduction Wideband (WB) and narrowband (NB) speech signals are two types of input signals that widely exist in speech-related applica- tions. In automatic speech recognition (ASR), acoustic models are usually separately trained for WB and NB speech data given their distinct spectral characteristics under different sampling rates. From the system deploy...

work page

[2] [2]

Related Work MB acoustic modeling for ASR has been investigated previously under various conditions [2, 3, 4, 5]. In [2], the NB data is used to leverage the training of WB acoustic models in a GMM-HMM framework and the missing components in upsampled NB speech features are dealt with using the expectation-maximization (EM) algorithm [6] for the MB GMM-HM...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Bandwidth Extension BWE has been an active research topic in communication and acoustics processing. NB speech signals, such as telephony speech signals, suffer from degraded quality and intelligibility due to the lack of high frequency spectral information eliminated by the low-pass band limitation of communication channels. Over the years, extensive res...

work page

[4] [4]

Feature Space The WB speech is sampled at 16KHz while the NB speech at 8KHz

System Implementation 4.1. Feature Space The WB speech is sampled at 16KHz while the NB speech at 8KHz. The input feature space consists of 40-dimensional logmel features after application of ﬁrst a global cepstral mean normalization (CMN) and then an utterance-based CMN. There are three input feature maps to the CNNs, the static logmel features and their...

work page

[5] [5]

There are 2,300 hours of NB training data which consists of 2,000 hours of Switchboard data and 300 hours of IBM call center data

Experimental Results There are 1,150 hours of WB training data which consists of 420 hours of Broadcast News data, 450 hours of internal dictation data, 100 hours of meeting data, 140 hours of hospitality (travel and hotel reservation) data and 40 hours of accented data. There are 2,300 hours of NB training data which consists of 2,000 hours of Switchboar...

work page

[6] [6]

The LM training data is selected from a broad variety of sources

The decoding vocabulary comprises of 250K words and the language model (LM) is a 4-gram LM with modiﬁed Kneser-Ney Smoothing consisting of 200M n-grams. The LM training data is selected from a broad variety of sources. Stronger language modeling techniques such as RNN LMs will be investigated in the future. The word error rates (WERs) of various models on...

work page 2000

[7] [7]

Discussion and Summary As can be observed from the breakdown performance in Table 2, consistent improvements of one technique across all test sets are rarely observed. The conclusion drawn from one particu- lar test set by one technique may not generalize to other test sets, although the average WERs can give us a good idea on the overall performance of c...

work page

[8] [8]

and its applications to BWE [ 17], we hope to further im- prove BWE in the context of MB modeling for consistent supe- rior performance

work page

[9] [9]

Very deep convolutional net- works for large-scale image recognition

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition.” in International Confer- ence on Learning Representations (ICLR), 2015

work page 2015

[10] [10]

Training wideband acoustic models using mixed-bandwidth training data for speech recognition,

M. L. Seltzer and A. Acero, “Training wideband acoustic models using mixed-bandwidth training data for speech recognition,”IEEE Transactions on Audio, Speech, and Language Processing,, vol. 15, no. 1, pp. 235–245, 2007

work page 2007

[11] [11]

Improving wideband speech recognition using mixed-bandwidth training data in CD- DNN-HMM,

J. Li, D. Yu, J.-T. Huang, and Y . Gong, “Improving wideband speech recognition using mixed-bandwidth training data in CD- DNN-HMM,” in IEEE Workshop on Spoken Language Technology (SLT), 2012, pp. 131–136

work page 2012

[12] [12]

An experi- mental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition,

J. Gao, J. Du, C. Kong, H. Lu, E. Chen, and C.-H. Lee, “An experi- mental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition,” in International Joint Conference on Neural Networks (IJCNN), 2016, pp. 588–594

work page 2016

[13] [13]

Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling,

J. Gao, J. Du, and E. Chen, “Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , pp. 559–571, March 2019

work page 2019

[14] [14]

Maximum likeli- hood from incomplete data via the EM algorithm,

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likeli- hood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977

work page 1977

[15] [15]

B. Iser, W. Minker, and G. Schmidt,Bandwidth extension of speech signals. Springer, 2008

work page 2008

[16] [16]

Bandwidth extension of speech sig- nals: a comprehensive review,

N. Prasad and T. Kumar, “Bandwidth extension of speech sig- nals: a comprehensive review,”International Journal of Intelligent Systems Technologies and Applications, vol. 2, no. 2, pp. 45–52, 2016

work page 2016

[17] [17]

A harmonic bandwidth extension method for audio codecs,

F. Nagel and S. Disch, “A harmonic bandwidth extension method for audio codecs,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 145–148

work page 2009

[18] [18]

Bandwidth extension of telephone speech using a neural network and a ﬁlter bank implementation for high- band Mel spectrum,

H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a ﬁlter bank implementation for high- band Mel spectrum,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2170–2183, 2011

work page 2011

[19] [19]

Effect of bandwidth extension to telephone speech recognition in cochlear implant users,

C. Liu, Q.-J. Fu, and S. Narayanan, “Effect of bandwidth extension to telephone speech recognition in cochlear implant users,” The Journal of the Acoustical Society of America , vol. 26, no. 5, pp. 77–83, 2018

work page 2018

[20] [20]

Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,

Z.-H. Ling, Y . Ai, Y . Gu, and L.-R. Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 125, no. 2, pp. 883–894, 2009

work page 2009

[21] [21]

DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech,

K. Li, Z. Huang, Y . Xu, , and C.-H. Lee, “DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech,” in Interspeech, 2015, pp. 2578–2582

work page 2015

[22] [22]

Gadei: On scale-up training as a service for deep learning,

W. Zhang, M. Feng, Y . Zheng, Y . Ren, Y . Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and F. Wang, “Gadei: On scale-up training as a service for deep learning,” in The IEEE International Conference on Data Mining (ICDM), 2017

work page 2017

[23] [23]

Nvidia collective communications library (NCCL),

“Nvidia collective communications library (NCCL),” https:// developer.nvidia.com/nccl, accessed: 2018-10-26

work page 2018

[24] [24]

Generative adver- sarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial nets,” Advances in Neural Information Processing Systems (NIPS), 2014

work page 2014

[25] [25]

CycleGAN bandwidth extension acoustic modeling for automatic speech recognition,

D. Haws and X. Cui, “CycleGAN bandwidth extension acoustic modeling for automatic speech recognition,” in International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 6780–6784

work page 2019