pith. machine review for the scientific record. sign in

arxiv: 2006.15704 · v1 · pith:WVMNY27Onew · submitted 2020-06-28 · 💻 cs.DC · cs.LG

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Pith reviewed 2026-05-17 19:09 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords distributed trainingdata parallelismPyTorchdeep learningGPU scalingall-reducegradient communication
0
0 comments X

The pith

PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explains the design of PyTorch's built-in support for data-parallel training across multiple GPUs. It covers practical techniques including grouping gradients into buckets for fewer network messages, running communication in the background while layers compute, and skipping synchronization steps when gradients are unchanged. Evaluations confirm that these choices let training time drop almost in proportion to the number of GPUs, even at 256 devices. Readers would care because the ability to train large models quickly on big datasets is a core bottleneck in modern deep learning.

Core claim

By replicating the model on each worker, computing gradients locally, and then using bucketing plus asynchronous all-reduce to keep replicas identical, the PyTorch distributed data parallel module attains near-linear scalability when configured appropriately on up to 256 GPUs.

What carries the argument

Gradient bucketing combined with overlapping of backward computation and all-reduce communication.

If this is right

  • Users can train larger models or use bigger batches without a proportional rise in wall-clock time.
  • Existing PyTorch code requires only a small wrapper to obtain these scaling benefits.
  • Clusters of commodity GPUs become practical for research that previously needed specialized hardware.
  • Communication overhead becomes a smaller fraction of total time as model size grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same overlap and bucketing ideas could be ported to other frameworks that lack native support.
  • At still larger scales the assumption of low network latency would need re-examination.
  • Adding automatic detection of when to skip synchronization could further reduce overhead.

Load-bearing premise

Typical deep learning models contain enough computation per layer for communication to be hidden behind it, and the network supports low-latency collective operations at the tested scale.

What would settle it

A benchmark run on 256 GPUs showing wall-clock training time that is more than 20 percent worse than the linear projection from a single-GPU baseline.

read the original abstract

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents the design, implementation, and evaluation of PyTorch's distributed data parallel (DDP) module as of v1.5. It covers optimizations including gradient bucketing, overlapping computation with communication via async all-reduce, and skipping gradient synchronization. The central empirical claim is that these techniques, when configured appropriately, enable near-linear scalability on up to 256 GPUs for standard vision and language models such as ResNet and BERT-scale.

Significance. If the reported measurements hold, the work offers substantial practical value by documenting engineering choices and their performance impact in a widely adopted framework. The concrete scalability results on 256 GPUs constitute reproducible empirical evidence that can guide practitioners and inform similar systems designs; the absence of parameter fitting or invented axioms keeps the contribution grounded in direct measurement.

major comments (1)
  1. [Evaluation] Evaluation section: the near-linear scalability claim at 256 GPUs rests on the assumption that per-bucket computation time exceeds all-reduce communication time on the target fabric. No sensitivity analysis or results are supplied for thinner models, smaller per-GPU batches, or lower arithmetic-intensity workloads where this overlap breaks, undermining the generality of the central claim.
minor comments (2)
  1. [Abstract] Abstract and Evaluation: benchmark details (exact model sizes, per-GPU batch sizes, and network hardware specifications) should be stated explicitly to allow independent verification of the 256-GPU numbers.
  2. [Implementation] Implementation: the description of how gradient bucketing interacts with the autograd engine could include a small diagram or pseudocode for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and the positive overall assessment of the work. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the near-linear scalability claim at 256 GPUs rests on the assumption that per-bucket computation time exceeds all-reduce communication time on the target fabric. No sensitivity analysis or results are supplied for thinner models, smaller per-GPU batches, or lower arithmetic-intensity workloads where this overlap breaks, undermining the generality of the central claim.

    Authors: We agree that the reported near-linear scaling results rely on workloads where per-bucket computation time is sufficient to overlap with communication. The manuscript evaluates the techniques on standard models (ResNet-50 and BERT-scale) with batch sizes typical for those models on the target 256-GPU cluster, as these represent the practical use cases motivating the optimizations. The abstract and evaluation section qualify the claim with the phrase 'when configured appropriately.' We do not claim universality across all possible models and batch sizes. In a revised version we will add a short paragraph in the evaluation section explicitly noting the workload dependence of the overlap benefit and stating that for thinner models or very small per-GPU batches the same configuration may yield sub-linear scaling, thereby clarifying the scope of the central claim without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems paper with claims resting on direct measurements

full rationale

The paper describes the design and implementation of PyTorch's distributed data parallel module, including techniques such as gradient bucketing, overlapping computation with communication, and skipping synchronization. Its central claim of near-linear scalability on 256 GPUs is presented as the outcome of empirical evaluations on standard models rather than any derivation, equation, or fitted parameter. No mathematical predictions, self-definitional constructs, uniqueness theorems, or self-citation chains appear in the load-bearing steps. The work is self-contained as an engineering report whose results are directly falsifiable via reproduction on the reported hardware and workloads.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard distributed deep-learning assumptions rather than new axioms or invented entities. No free parameters are introduced to fit results; the work is an engineering report of existing techniques applied inside PyTorch.

axioms (2)
  • domain assumption Data parallelism replicates the full model on each device and averages gradients after each iteration to keep replicas consistent.
    Invoked in the abstract description of the basic technique.
  • domain assumption Gradient communication can be overlapped with subsequent layer computation when the network and model structure permit.
    Central to the overlapping optimization described.

pith-pipeline@v0.9.0 · 5513 in / 1362 out tokens · 73462 ms · 2026-05-17T19:09:51.262341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

    cs.LG 2026-04 unverdicted novelty 7.0

    ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...

  2. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    cs.CV 2023-03 conditional novelty 7.0

    BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumo...

  3. Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

    cs.LG 2026-05 unverdicted novelty 6.0

    Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...

  4. ShardTensor: Domain Parallelism for Scientific Machine Learning

    cs.DC 2026-05 unverdicted novelty 6.0

    ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

  5. Accelerating Compound LLM Training Workloads with Maestro

    cs.DC 2026-05 unverdicted novelty 6.0

    Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.

  6. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

  7. ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  8. CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

    cs.LG 2026-04 unverdicted novelty 6.0

    CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

  9. Training Time Prediction for Mixed Precision-based Distributed Training

    cs.LG 2026-04 unverdicted novelty 6.0

    A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.

  10. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  11. veScale-FSDP: Flexible and High-Performance FSDP at Scale

    cs.DC 2026-02 unverdicted novelty 6.0

    veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...

  12. Gradient-descent methods for scalable quantum detector tomography

    quant-ph 2025-11 conditional novelty 6.0

    Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.

  13. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    cs.DC 2023-04 unverdicted novelty 6.0

    PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.

  14. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  15. Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

    cs.DC 2026-05 unverdicted novelty 5.0

    An adaptive DNN partitioning framework for heterogeneous edge-cloud systems reduces energy consumption by 27-36% and end-to-end latency by 6-23% versus static baselines on real hardware with VGG16, AlexNet, and MobileNetV2.

  16. Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

    math.OC 2026-05 unverdicted novelty 5.0

    Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.

  17. CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

    cs.DC 2026-05 unverdicted novelty 4.0

    CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...

  18. Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

    cs.DC 2025-11 unverdicted novelty 4.0

    Thermal imbalance in multi-GPU nodes creates hotter straggler GPUs that slow down cooler leader GPUs during overlapped computation and communication in LLM training.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 18 Pith papers · 6 internal anchors

  1. [1]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    INTRODUCTION Deep Neural Networks (DNN) have powered a wide spec- trum of applications, ranging from image recognition [20], language translation [15], anomaly detection [16], content recommendation [38], to drug discovery [33], art genera- tion [28], game play [18], and self-driving cars [13]. Many applications pursue higher intelligence by optimizing la...

  2. [2]

    Then, we explain and justify the idea of data parallelism and describe communication primitives

    BACKGROUND Before diving into distributed training, let us briefly dis- cuss the implementation and execution of local model train- ing using PyTorch. Then, we explain and justify the idea of data parallelism and describe communication primitives. 2.1 PyTorch PyTorch organizes values into Tensors which are generic n-dimensional arrays with a rich set of da...

  3. [3]

    During distributed training, each pro- cess has its own local model replica and local optimizer

    SYSTEM DESIGN PyTorch [30] provides aDistributedDataParallel (DDP1) module to help easily parallelize training across multiple pro- cesses and machines. During distributed training, each pro- cess has its own local model replica and local optimizer. In terms of correctness, distributed data parallel training and local training must be mathematically equiv...

  4. [4]

    This section focus on the current status as of PyTorch v1.5.0

    IMPLEMENTA TION The implementation of DDP has evolved several times in the past few releases. This section focus on the current status as of PyTorch v1.5.0. DDP implementation lives both in Python and C++ files, with Python exposing the API and composing non-performance-critical components, and C++ serving the core gradient reduction algorithm. The Python ...

  5. [5]

    In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC

    EV ALUA TION This section presents the evaluation results of PyTorch DDP using an exclusive 32 GPU cluster and a shared enti- tlement. In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC. All 4 servers reside in the same rack, and each server is equipped with 8 NVIDIA Tesla V100 GPUs. Fig. 5...

  6. [6]

    We then present several ideas for future improvements

    DISCUSSION This section discusses lessons learned from our experi- ments and past experiences. We then present several ideas for future improvements. 6.1 Lessons Learned Distributed data parallel training is a conceptually sim- ple or practically subtle framework. There are various tech- niques to improve its speed, creating a complex configura- tion space...

  7. [7]

    Below are three popular categorizations

    RELA TED WORK Distributed training algorithms can be categorized into different types from different perspectives. Below are three popular categorizations. • Synchronous update vs Asynchronous update: With the former, all model replicas can useAllReduce to col- lectively communicate gradients or parameters, while the asynchronous scheme employs P2P communic...

  8. [8]

    DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping synchronizations

    CONCLUSION This paper explained the design and implementation of the distributed data parallel module in PyTorch v1.5, and conducted performance evaluations on NCCL and Gloo back- end using ResNet50 and BERT models. DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping ...

  9. [9]

    https://github.com/facebookincubator/gloo, 2019

    Gloo: a collective communications library. https://github.com/facebookincubator/gloo, 2019

  10. [10]

    https://developer.nvidia.com/nccl, 2019

    NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2019

  11. [11]

    https: //www.nvidia.com/en-us/data-center/nvlink/, 2019

    NVLINK AND NVSWITCH: The Building Blocks of Advanced Multi-GPU Communication. https: //www.nvidia.com/en-us/data-center/nvlink/, 2019

  12. [12]

    https://www.open-mpi.org/, 2019

    Open MPI: A High Performance Message Passing Library. https://www.open-mpi.org/, 2019

  13. [13]

    https://pybind11.readthedocs.io/, 2019

    Pybind11: Seamless operability between C++11 and Python. https://pybind11.readthedocs.io/, 2019

  14. [14]

    https://pytorch.org/docs/master/rpc.html, 2019

    PyTorch Distributed RPC Framework. https://pytorch.org/docs/master/rpc.html, 2019

  15. [15]

    https://pytorch.org/docs/stable/nn.html#torch

    PyTorch Module forward Function. https://pytorch.org/docs/stable/nn.html#torch. nn.Module.forward, 2019

  16. [16]

    https://docs.scipy.org/, 2019

    SciPy: open-source software for mathematics, science, and engineering. https://docs.scipy.org/, 2019

  17. [17]

    https://pytorch.org/docs/stable/nn.html#torch

    PyTorch DistributedDataParallel. https://pytorch.org/docs/stable/nn.html#torch. nn.parallel.DistributedDataParallel, 2020

  18. [18]

    https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020

    TensorFlow Distributed Training MultiWorkerMirroredStrategy. https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020

  19. [19]

    https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020

    TensorFlow Distributed Training ParameterServerStrategy. https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020

  20. [20]

    Y. Bao, Y. Peng, Y. Chen, and C. Wu. Preemptive all-reduce scheduling for expediting distributed dnn training. In IEEE INFOCOM, 2020

  21. [21]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016

  22. [22]

    M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter. Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development , 63(6):1–1, 2019

  23. [23]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  24. [24]

    M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017

  25. [25]

    A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 , 2019

  26. [26]

    X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems , pages 3338–3346, 2014

  27. [27]

    S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018

  28. [28]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  29. [29]

    Huang, Y

    Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pages 103–112, 2019

  30. [30]

    S. Jeaugey. Massively Scale Your Deep Learning Training with NCCL 2.4. https://devblogs.nvidia.com/ massively-scale-deep-learning-training-nccl-2-4/ , February 2019

  31. [31]

    Kim, G.-I

    S. Kim, G.-I. Yu, H. Park, S. Cho, E. Jeong, H. Ha, S. Lee, J. S. Jeong, and B.-G. Chun. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–15, 2019

  32. [32]

    Kosaian, K

    J. Kosaian, K. V. Rashmi, and S. Venkataraman. Parity models: Erasure-coded resilience for prediction serving systems. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , SOSP 19, page 3046, New York, NY, USA, 2019. Association for Computing Machinery

  33. [33]

    LeCun, C

    Y. LeCun, C. Cortes, and C. Burges. The MNIST Database. http://yann.lecun.com/exdb/mnist/, 1999

  34. [34]

    LeCun, D

    Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988

  35. [35]

    M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014

  36. [36]

    H. Mao, M. Cheung, and J. She. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM international conference on Multimedia , pages 1183–1191, 2017

  37. [37]

    Narayanan, A

    D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

  38. [38]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Sys...

  39. [39]

    Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16–29, 2019

  40. [40]

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054 , 2019

  41. [41]

    Ramsundar, P

    B. Ramsundar, P. Eastman, P. Walters, and V. Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. ” O’Reilly Media, Inc.”, 2019

  42. [42]

    Seide, H

    F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014

  43. [43]

    Horovod: fast and easy distributed deep learning in TensorFlow

    A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018

  44. [44]

    Shazeer, Y

    N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018

  45. [45]

    P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan. Gradientflow: Optimizing network performance for large-scale distributed dnn training. IEEE Transactions on Big Data , 2019

  46. [46]

    Van den Oord, S

    A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems , pages 2643–2651, 2013

  47. [47]

    G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. Devanur, and I. Stoica. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940, 2019

  48. [48]

    J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019. 14