A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition
Pith reviewed 2026-05-24 23:16 UTC · model grok-4.3
The pith
ADPSGD converges with a batch size three times larger than synchronous SGD for ASR training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can converge with a batch size 3X as large as the one used in Synchronous SGD on the SWB-300 and SWB-2000 ASR datasets. A Hierarchical-ADPSGD system lets learners on the same node form a super learner via fast allreduce while super learners communicate with ADPSGD. This reaches 7.6 percent WER on Hub5-2000 SWB and 13.2 percent WER on CH in 5.2 hours on 64 V100 GPUs connected by 100 Gb/s Ethernet.
What carries the argument
Hierarchical-ADPSGD (H-ADPSGD), which forms intra-node super learners with allreduce and runs ADPSGD among the super learners to support larger batches at scale.
If this is right
- Larger batch sizes reduce the total number of communication rounds required during distributed training.
- The hierarchical design exploits fast local networks to keep most data movement inside nodes while still benefiting from decentralized updates across nodes.
- The reported 5.2-hour training time on 64 GPUs sets a concrete speed benchmark for reaching 7.6 percent WER on the Hub5-2000 SWB test set.
- The method directly enables scaling ASR training to clusters larger than those practical with synchronous SGD.
Where Pith is reading between the lines
- The same decentralized batch-size tolerance could be tested on other sequence tasks whose loss surfaces resemble those of deep ASR models.
- Reducing synchronization points may interact with learning-rate schedules or momentum settings in ways not explored here.
- The Ethernet-based 100 Gb/s result suggests the approach remains practical on commodity interconnects rather than requiring specialized fabrics.
Load-bearing premise
The batch-size convergence advantage of ADPSGD over SSGD observed on the SWB-300 and SWB-2000 datasets will hold for the model architectures and optimization settings used.
What would settle it
A side-by-side run on the same SWB-2000 model and hardware where ADPSGD with the 3X batch size diverges or produces substantially higher word error rates than SSGD with its smaller batch size would falsify the core advantage.
read the original abstract
Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than commonly used Synchronous SGD (SSGD) algorithm. On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale. Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD algorithm among themselves. On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Call-home (CH) test-set in 5.2 hours. To the best of our knowledge, this is the fastest ASR training system that attains this level of model accuracy for SWB-2000 task to be ever reported in the literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) and its hierarchical variant (H-ADPSGD) for distributed ASR training. It claims that ADPSGD converges at batch sizes 3X larger than those used with Synchronous SGD (SSGD) on the public SWB-300 and SWB-2000 datasets, and that H-ADPSGD reaches 7.6% WER on Hub5-2000 SWB and 13.2% WER on CH in 5.2 hours using 64 V100 GPUs over 100Gb/s Ethernet, asserting this is the fastest reported training for this accuracy level on SWB-2000.
Significance. If the empirical claims hold with properly controlled baselines, the results would demonstrate a practical route to larger-batch decentralized training for ASR, enabling faster iteration on large models. The concrete WER numbers on named public datasets provide a clear, falsifiable benchmark that strengthens the contribution relative to purely theoretical distributed-optimization papers.
major comments (2)
- [Abstract] Abstract: the central claim that ADPSGD 'can converge with a batch size 3X as large as the one used in SSGD' is load-bearing for the paper's contribution, yet the text provides no indication that the SSGD large-batch runs applied standard learning-rate scaling (linear or sqrt) or identical optimizer schedules. Without this detail the observed non-convergence of SSGD could be explained by an unscaled LR rather than by the synchronous vs. asynchronous distinction.
- [Abstract] Abstract: the reported WER figures (7.6% SWB, 13.2% CH) and the 'fastest training' assertion are presented without error bars, number of runs, or explicit comparison tables against prior distributed ASR systems on the same datasets and hardware class, making it impossible to assess whether the 5.2-hour result is statistically or practically superior.
minor comments (1)
- [Abstract] Abstract: grammatical error in 'rely on distributed deep learning to for quick training completion'.
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments, which help strengthen the clarity of our claims. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that ADPSGD 'can converge with a batch size 3X as large as the one used in SSGD' is load-bearing for the paper's contribution, yet the text provides no indication that the SSGD large-batch runs applied standard learning-rate scaling (linear or sqrt) or identical optimizer schedules. Without this detail the observed non-convergence of SSGD could be explained by an unscaled LR rather than by the synchronous vs. asynchronous distinction.
Authors: We agree that explicit confirmation of the learning-rate scaling for the SSGD baseline is necessary to support the central claim. In the experiments reported in the manuscript, linear learning-rate scaling was applied to SSGD (proportional to batch size) along with the same optimizer hyperparameters as ADPSGD. This detail appears in the experimental setup section but was omitted from the abstract. We will revise the abstract to state that standard linear LR scaling was used for the SSGD comparisons and will add a brief clarifying sentence in the methods to ensure the distinction is unambiguous. revision: yes
-
Referee: [Abstract] Abstract: the reported WER figures (7.6% SWB, 13.2% CH) and the 'fastest training' assertion are presented without error bars, number of runs, or explicit comparison tables against prior distributed ASR systems on the same datasets and hardware class, making it impossible to assess whether the 5.2-hour result is statistically or practically superior.
Authors: We acknowledge that the abstract presents the WER numbers and the 'fastest' claim without accompanying statistical context or a compact comparison table. The manuscript body contains a related-work discussion and timing comparisons, but these are not summarized in the abstract and lack explicit error-bar reporting (results are from single runs). We will revise the abstract to note that the reported times and WERs are from single training runs on the specified hardware and will insert a concise comparison table in the results section against prior distributed ASR systems on SWB-2000 using comparable GPU clusters. revision: yes
Circularity Check
No circularity: purely empirical claims on public datasets with no derivations or self-referential fits
full rationale
The paper reports measured WER and training times for ADPSGD vs SSGD on the external public SWB-300/SWB-2000 corpora using 64 V100 GPUs. No equations, ansatzes, fitted parameters renamed as predictions, or uniqueness theorems appear in the abstract or described full text. All central claims are direct experimental outcomes on independent test sets (Hub5-2000 SWB/CH), so the derivation chain is empty and the work is self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Deep Learning (DL) drives the current Automatic Speech Recognition (ASR) systems and has yielded models of unprece- dented accuracy [1, 2]. Stochastic Gradients Descent (SGD) and its variants are the de facto learning algorithms deployed in DL training systems. Distributed Deep Learning (DDL), which deploys different variants of parallel SGD ...
work page 2000
-
[2]
A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition
Background and Problem Formulation Consider the following stochastic optimization problem min θ F (θ) = Eξ[f(θ;ξ)] (1) whereF is the objective function,θ is the parameters to be op- timized (it is the weights of networks for DL) and ξ∼p(x) is a random variable on the training data x obeying distribution p(x). Supposing that there are n training samples an...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Scalability of ADPSGD It is well known that when batch-size is increased, it is difficult for a DDL system to maintain model accuracy [3, 8]. In our previous work [9], we designed a principled method to increase batch size while maintaining model accuracy with respect to training epochs for both SSGD and ADPSGD on ASR tasks, up to batch size 2560. The key ...
work page 2000
-
[4]
Design of Hierarchical ADPSGD In practice, ADPSGD sees signficant accuracy drop when scal- ing over more than 16 learners on ASR tasks [5, 9] due to system staleness issue. We built H-ADPSGD, which is a hi- erarchical system as depicted in Figure 3, to address the stal- eness issue. N learners constructs a super-learner, which ap- plies the weight update r...
-
[5]
Software and Hardware PyTorch 0.4.1 is our DL framework
Methodology 5.1. Software and Hardware PyTorch 0.4.1 is our DL framework. Our communication li- brary is built with CUDA 9.2 compiler, the CUDA-aware Open- MPI 3.1.1, and g++ 4.8.5 compiler. We run our experiments on a 64-GPU 8-server cluster. Each server has 2 sockets and 9 cores per socket. Each core is an Intel Xeon E5-2697 2.3GHz processor. Each serve...
work page 2000
-
[6]
Experimental Results 6.1. Convergence Results Table 1 records the WER of SWB-2000 models trained by SSGD and ADPSGD under different batch sizes. Single-gpu training baseline is also given as a reference. ADPSGD can converge with a batch size 3x larger than that of SSGD, while maintaining model accuracy. 6.2. Speedup Figure 4 shows the H-ADPSGD speedup. Us...
work page 2000
-
[7]
Conclusion and Future Work In this work, we made the following contributions: (1) We dis- covered that ADPSGD can scale with much larger batch sizes than the commonly used SSGD algorithm for ASR tasks. To the best of our knowledge, this is the first asynchronous sys- tem that scales with larger batch sizes than a synchronous sys- tem for public large-scale...
work page 2000
-
[8]
Related Work DDL systems enable many AI applications with unprecedented accuracy, such as speech recognition [7, 17], computer vision [3], language modeling [18], and machine translation [19]. Dur- ing the early days of DDL system research, researchers could only rely on loosely-coupled inexpensive computing systems and adopted PS-based ASGD algorithm [4]...
-
[9]
English conversational telephone speech recognition by humans and machines,
G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dim- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall, “English conversational telephone speech recognition by humans and machines,” in Interspeech, 2017
work page 2017
-
[10]
Toward human parity in conversa- tional speech recognition,
W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stol- cke, D. Yu, and G. Zweig, “Toward human parity in conversa- tional speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 25, no. 12, pp. 2410– 2423, Dec 2017
work page 2017
-
[11]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
P. Goyal, P. Doll ´ar, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017. [Online]. Available: http://arxiv.org/abs/ 1706.02677
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Large scale distributed deep networks,
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V . Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y . Ng, “Large scale distributed deep networks,” in NIPS, 2012
work page 2012
-
[13]
Asynchronous decen- tralized parallel stochastic gradient descent,
X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decen- tralized parallel stochastic gradient descent,” in ICML, 2018
work page 2018
-
[14]
Revisiting Distributed Synchronous SGD
J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” in International Conference on Learning Representations Workshop Track , 2016. [Online]. Available: https://arxiv.org/abs/1604.00981
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
Deep speech 2 : End-to-end speech recognition in english and mandarin,
D. Amodei(et.al.), “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in ICML’16. PMLR, 2016, pp. 173–182. [Online]. Available: http://proceedings.mlr. press/v48/amodei16.html
work page 2016
-
[16]
Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,
W. Zhang, S. Gupta, and F. Wang, “Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,” inIEEE International Conference on Data Mining, 2016
work page 2016
-
[17]
Distributed deep learning strategies for auto- matic speech recognition,
W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung, and M. Picheny, “Distributed deep learning strategies for auto- matic speech recognition,” in ICASSP’2019, May 2019
work page 2019
-
[18]
Staleness-aware async- sgd for distributed deep learning,
W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware async- sgd for distributed deep learning,” in Proceedings of the Twenty- Fifth International Joint Conference on Artificial Intelligence, IJ- CAI 2016, New York, NY, USA, 9-15 July 2016 , 2016, pp. 2350– 2356
work page 2016
-
[19]
Bandwidth optimal all-reduce algo- rithms for clusters of workstations,
P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algo- rithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, pp. 117–124, 2009
work page 2009
-
[20]
Baidu, Effectively Scaling Deep Learning Frameworks, available at https://github.com/baidu-research/baidu-allreduce
- [21]
-
[22]
M. Cho, U. Finkler, S. Kumar, D. S. Kung, V . Saxena, and D. Sreedhar, “Powerai DDL,” CoRR, vol. abs/1708.02188, 2017. [Online]. Available: http://arxiv.org/abs/1708.02188
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Wildfire: Approximate synchronization of parameters in distributed deep learning,
R. Nair and S. Gupta, “Wildfire: Approximate synchronization of parameters in distributed deep learning,”IBM Journal of Research and Development, vol. 61, no. 4/5, pp. 7:1–7:9, July 2017
work page 2017
-
[24]
X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized al- gorithms? A case study for decentralized parallel stochastic gra- dient descent,” in NIPS, 2017
work page 2017
-
[25]
K. Chen and Q. Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimiza- tion and blockwise model-update filtering,” in ICASSP’2016, March 2016
work page 2016
-
[26]
Large Scale Language Modeling: Converging on 40GB of Text in Four Hours
R. Puri, R. Kirby, N. Yakovenko, and B. Catanzaro, “Large scale language modeling: Converging on 40gb of text in four hours,” CoRR, vol. abs/1808.01371, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Scaling Neural Machine Translation
M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine translation,” EMNLP 2018 THIRD CONFERENCE ON MACHINE TRANSLATION, vol. abs/1806.00187, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Gadei: On scale-up training as a service for deep learning
W. Zhang, M. Feng, Y . Zheng, Y . Ren, Y . Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and F. Wang, “Gadei: On scale-up training as a service for deep learning.” The IEEE International Conference on Data Mining series(ICDM’2017), 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.