pith. sign in

arxiv: 1907.05701 · v1 · pith:RA5HLQ62new · submitted 2019-07-10 · 📡 eess.AS · cs.DC· cs.LG· cs.SD· stat.ML

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Pith reviewed 2026-05-24 23:16 UTC · model grok-4.3

classification 📡 eess.AS cs.DCcs.LGcs.SDstat.ML
keywords automatic speech recognitiondistributed deep learningADPSGDbatch sizeword error ratehierarchical trainingGPU clustersynchronous SGD
0
0 comments X

The pith

ADPSGD converges with a batch size three times larger than synchronous SGD for ASR training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern ASR systems need distributed deep learning to finish training quickly, but convergence fails when mini-batch sizes grow too large. The paper reports that Asynchronous Decentralized Parallel Stochastic Gradient Descent tolerates batch sizes three times those usable by standard synchronous SGD on the SWB-300 and SWB-2000 datasets. The authors then build a hierarchical version that groups same-node learners into super learners with fast allreduce and runs the decentralized algorithm across those super learners. On a 64-GPU cluster this system reaches 7.6 percent WER on the Hub5-2000 Switchboard test set and 13.2 percent WER on Call-home in 5.2 hours.

Core claim

Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can converge with a batch size 3X as large as the one used in Synchronous SGD on the SWB-300 and SWB-2000 ASR datasets. A Hierarchical-ADPSGD system lets learners on the same node form a super learner via fast allreduce while super learners communicate with ADPSGD. This reaches 7.6 percent WER on Hub5-2000 SWB and 13.2 percent WER on CH in 5.2 hours on 64 V100 GPUs connected by 100 Gb/s Ethernet.

What carries the argument

Hierarchical-ADPSGD (H-ADPSGD), which forms intra-node super learners with allreduce and runs ADPSGD among the super learners to support larger batches at scale.

If this is right

  • Larger batch sizes reduce the total number of communication rounds required during distributed training.
  • The hierarchical design exploits fast local networks to keep most data movement inside nodes while still benefiting from decentralized updates across nodes.
  • The reported 5.2-hour training time on 64 GPUs sets a concrete speed benchmark for reaching 7.6 percent WER on the Hub5-2000 SWB test set.
  • The method directly enables scaling ASR training to clusters larger than those practical with synchronous SGD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decentralized batch-size tolerance could be tested on other sequence tasks whose loss surfaces resemble those of deep ASR models.
  • Reducing synchronization points may interact with learning-rate schedules or momentum settings in ways not explored here.
  • The Ethernet-based 100 Gb/s result suggests the approach remains practical on commodity interconnects rather than requiring specialized fabrics.

Load-bearing premise

The batch-size convergence advantage of ADPSGD over SSGD observed on the SWB-300 and SWB-2000 datasets will hold for the model architectures and optimization settings used.

What would settle it

A side-by-side run on the same SWB-2000 model and hardware where ADPSGD with the 3X batch size diverges or produces substantially higher word error rates than SSGD with its smaller batch size would falsify the core advantage.

read the original abstract

Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than commonly used Synchronous SGD (SSGD) algorithm. On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale. Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD algorithm among themselves. On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Call-home (CH) test-set in 5.2 hours. To the best of our knowledge, this is the fastest ASR training system that attains this level of model accuracy for SWB-2000 task to be ever reported in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) and its hierarchical variant (H-ADPSGD) for distributed ASR training. It claims that ADPSGD converges at batch sizes 3X larger than those used with Synchronous SGD (SSGD) on the public SWB-300 and SWB-2000 datasets, and that H-ADPSGD reaches 7.6% WER on Hub5-2000 SWB and 13.2% WER on CH in 5.2 hours using 64 V100 GPUs over 100Gb/s Ethernet, asserting this is the fastest reported training for this accuracy level on SWB-2000.

Significance. If the empirical claims hold with properly controlled baselines, the results would demonstrate a practical route to larger-batch decentralized training for ASR, enabling faster iteration on large models. The concrete WER numbers on named public datasets provide a clear, falsifiable benchmark that strengthens the contribution relative to purely theoretical distributed-optimization papers.

major comments (2)
  1. [Abstract] Abstract: the central claim that ADPSGD 'can converge with a batch size 3X as large as the one used in SSGD' is load-bearing for the paper's contribution, yet the text provides no indication that the SSGD large-batch runs applied standard learning-rate scaling (linear or sqrt) or identical optimizer schedules. Without this detail the observed non-convergence of SSGD could be explained by an unscaled LR rather than by the synchronous vs. asynchronous distinction.
  2. [Abstract] Abstract: the reported WER figures (7.6% SWB, 13.2% CH) and the 'fastest training' assertion are presented without error bars, number of runs, or explicit comparison tables against prior distributed ASR systems on the same datasets and hardware class, making it impossible to assess whether the 5.2-hour result is statistically or practically superior.
minor comments (1)
  1. [Abstract] Abstract: grammatical error in 'rely on distributed deep learning to for quick training completion'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help strengthen the clarity of our claims. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ADPSGD 'can converge with a batch size 3X as large as the one used in SSGD' is load-bearing for the paper's contribution, yet the text provides no indication that the SSGD large-batch runs applied standard learning-rate scaling (linear or sqrt) or identical optimizer schedules. Without this detail the observed non-convergence of SSGD could be explained by an unscaled LR rather than by the synchronous vs. asynchronous distinction.

    Authors: We agree that explicit confirmation of the learning-rate scaling for the SSGD baseline is necessary to support the central claim. In the experiments reported in the manuscript, linear learning-rate scaling was applied to SSGD (proportional to batch size) along with the same optimizer hyperparameters as ADPSGD. This detail appears in the experimental setup section but was omitted from the abstract. We will revise the abstract to state that standard linear LR scaling was used for the SSGD comparisons and will add a brief clarifying sentence in the methods to ensure the distinction is unambiguous. revision: yes

  2. Referee: [Abstract] Abstract: the reported WER figures (7.6% SWB, 13.2% CH) and the 'fastest training' assertion are presented without error bars, number of runs, or explicit comparison tables against prior distributed ASR systems on the same datasets and hardware class, making it impossible to assess whether the 5.2-hour result is statistically or practically superior.

    Authors: We acknowledge that the abstract presents the WER numbers and the 'fastest' claim without accompanying statistical context or a compact comparison table. The manuscript body contains a related-work discussion and timing comparisons, but these are not summarized in the abstract and lack explicit error-bar reporting (results are from single runs). We will revise the abstract to note that the reported times and WERs are from single training runs on the specified hardware and will insert a concise comparison table in the results section against prior distributed ASR systems on SWB-2000 using comparable GPU clusters. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims on public datasets with no derivations or self-referential fits

full rationale

The paper reports measured WER and training times for ADPSGD vs SSGD on the external public SWB-300/SWB-2000 corpora using 64 V100 GPUs. No equations, ansatzes, fitted parameters renamed as predictions, or uniqueness theorems appear in the abstract or described full text. All central claims are direct experimental outcomes on independent test sets (Hub5-2000 SWB/CH), so the derivation chain is empty and the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work is purely empirical performance reporting.

pith-pipeline@v0.9.0 · 5831 in / 1243 out tokens · 26117 ms · 2026-05-24T23:16:34.891935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

  1. [1]

    super-learner

    Introduction Deep Learning (DL) drives the current Automatic Speech Recognition (ASR) systems and has yielded models of unprece- dented accuracy [1, 2]. Stochastic Gradients Descent (SGD) and its variants are the de facto learning algorithms deployed in DL training systems. Distributed Deep Learning (DDL), which deploys different variants of parallel SGD ...

  2. [2]

    A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

    Background and Problem Formulation Consider the following stochastic optimization problem min θ F (θ) = Eξ[f(θ;ξ)] (1) whereF is the objective function,θ is the parameters to be op- timized (it is the weights of networks for DL) and ξ∼p(x) is a random variable on the training data x obeying distribution p(x). Supposing that there are n training samples an...

  3. [3]

    Scalability of ADPSGD It is well known that when batch-size is increased, it is difficult for a DDL system to maintain model accuracy [3, 8]. In our previous work [9], we designed a principled method to increase batch size while maintaining model accuracy with respect to training epochs for both SSGD and ADPSGD on ASR tasks, up to batch size 2560. The key ...

  4. [4]

    We built H-ADPSGD, which is a hi- erarchical system as depicted in Figure 3, to address the stal- eness issue

    Design of Hierarchical ADPSGD In practice, ADPSGD sees signficant accuracy drop when scal- ing over more than 16 learners on ASR tasks [5, 9] due to system staleness issue. We built H-ADPSGD, which is a hi- erarchical system as depicted in Figure 3, to address the stal- eness issue. N learners constructs a super-learner, which ap- plies the weight update r...

  5. [5]

    Software and Hardware PyTorch 0.4.1 is our DL framework

    Methodology 5.1. Software and Hardware PyTorch 0.4.1 is our DL framework. Our communication li- brary is built with CUDA 9.2 compiler, the CUDA-aware Open- MPI 3.1.1, and g++ 4.8.5 compiler. We run our experiments on a 64-GPU 8-server cluster. Each server has 2 sockets and 9 cores per socket. Each core is an Intel Xeon E5-2697 2.3GHz processor. Each serve...

  6. [6]

    Convergence Results Table 1 records the WER of SWB-2000 models trained by SSGD and ADPSGD under different batch sizes

    Experimental Results 6.1. Convergence Results Table 1 records the WER of SWB-2000 models trained by SSGD and ADPSGD under different batch sizes. Single-gpu training baseline is also given as a reference. ADPSGD can converge with a batch size 3x larger than that of SSGD, while maintaining model accuracy. 6.2. Speedup Figure 4 shows the H-ADPSGD speedup. Us...

  7. [7]

    To the best of our knowledge, this is the first asynchronous sys- tem that scales with larger batch sizes than a synchronous sys- tem for public large-scale DL tasks

    Conclusion and Future Work In this work, we made the following contributions: (1) We dis- covered that ADPSGD can scale with much larger batch sizes than the commonly used SSGD algorithm for ASR tasks. To the best of our knowledge, this is the first asynchronous sys- tem that scales with larger batch sizes than a synchronous sys- tem for public large-scale...

  8. [8]

    Dur- ing the early days of DDL system research, researchers could only rely on loosely-coupled inexpensive computing systems and adopted PS-based ASGD algorithm [4]

    Related Work DDL systems enable many AI applications with unprecedented accuracy, such as speech recognition [7, 17], computer vision [3], language modeling [18], and machine translation [19]. Dur- ing the early days of DDL system research, researchers could only rely on loosely-coupled inexpensive computing systems and adopted PS-based ASGD algorithm [4]...

  9. [9]

    English conversational telephone speech recognition by humans and machines,

    G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dim- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall, “English conversational telephone speech recognition by humans and machines,” in Interspeech, 2017

  10. [10]

    Toward human parity in conversa- tional speech recognition,

    W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stol- cke, D. Yu, and G. Zweig, “Toward human parity in conversa- tional speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 25, no. 12, pp. 2410– 2423, Dec 2017

  11. [11]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    P. Goyal, P. Doll ´ar, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017. [Online]. Available: http://arxiv.org/abs/ 1706.02677

  12. [12]

    Large scale distributed deep networks,

    J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V . Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y . Ng, “Large scale distributed deep networks,” in NIPS, 2012

  13. [13]

    Asynchronous decen- tralized parallel stochastic gradient descent,

    X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decen- tralized parallel stochastic gradient descent,” in ICML, 2018

  14. [14]

    Revisiting Distributed Synchronous SGD

    J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” in International Conference on Learning Representations Workshop Track , 2016. [Online]. Available: https://arxiv.org/abs/1604.00981

  15. [15]

    Deep speech 2 : End-to-end speech recognition in english and mandarin,

    D. Amodei(et.al.), “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in ICML’16. PMLR, 2016, pp. 173–182. [Online]. Available: http://proceedings.mlr. press/v48/amodei16.html

  16. [16]

    Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,

    W. Zhang, S. Gupta, and F. Wang, “Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,” inIEEE International Conference on Data Mining, 2016

  17. [17]

    Distributed deep learning strategies for auto- matic speech recognition,

    W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung, and M. Picheny, “Distributed deep learning strategies for auto- matic speech recognition,” in ICASSP’2019, May 2019

  18. [18]

    Staleness-aware async- sgd for distributed deep learning,

    W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware async- sgd for distributed deep learning,” in Proceedings of the Twenty- Fifth International Joint Conference on Artificial Intelligence, IJ- CAI 2016, New York, NY, USA, 9-15 July 2016 , 2016, pp. 2350– 2356

  19. [19]

    Bandwidth optimal all-reduce algo- rithms for clusters of workstations,

    P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algo- rithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, pp. 117–124, 2009

  20. [20]

    Baidu, Effectively Scaling Deep Learning Frameworks, available at https://github.com/baidu-research/baidu-allreduce

  21. [21]

    [Online]

    Nvidia, NCCL: Optimized primitives for collective multi-GPU communication, available at https://github.com/NVIDIA/nccl. [Online]. Available: https://github.com/NVIDIA/nccl

  22. [22]

    PowerAI DDL

    M. Cho, U. Finkler, S. Kumar, D. S. Kung, V . Saxena, and D. Sreedhar, “Powerai DDL,” CoRR, vol. abs/1708.02188, 2017. [Online]. Available: http://arxiv.org/abs/1708.02188

  23. [23]

    Wildfire: Approximate synchronization of parameters in distributed deep learning,

    R. Nair and S. Gupta, “Wildfire: Approximate synchronization of parameters in distributed deep learning,”IBM Journal of Research and Development, vol. 61, no. 4/5, pp. 7:1–7:9, July 2017

  24. [24]

    Can decentralized algorithms outperform centralized al- gorithms? A case study for decentralized parallel stochastic gra- dient descent,

    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized al- gorithms? A case study for decentralized parallel stochastic gra- dient descent,” in NIPS, 2017

  25. [25]

    Scalable training of deep learning machines by incremental block training with intra-block parallel optimiza- tion and blockwise model-update filtering,

    K. Chen and Q. Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimiza- tion and blockwise model-update filtering,” in ICASSP’2016, March 2016

  26. [26]

    Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

    R. Puri, R. Kirby, N. Yakovenko, and B. Catanzaro, “Large scale language modeling: Converging on 40gb of text in four hours,” CoRR, vol. abs/1808.01371, 2018

  27. [27]

    Scaling Neural Machine Translation

    M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine translation,” EMNLP 2018 THIRD CONFERENCE ON MACHINE TRANSLATION, vol. abs/1806.00187, 2018

  28. [28]

    Gadei: On scale-up training as a service for deep learning

    W. Zhang, M. Feng, Y . Zheng, Y . Ren, Y . Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and F. Wang, “Gadei: On scale-up training as a service for deep learning.” The IEEE International Conference on Data Mining series(ICDM’2017), 2017