PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Adam Paszke; Brian Vaughan; Jeff Smith; Omkar Salpekar; Pieter Noordhuis; Pritam Damania; Rohan Varma; Shen Li; Soumith Chintala; Teng Li

REVIEW 1 major objections 2 minor 37 cited by

Reviewed by Pith at T0; open to challenge.

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication.

2026-05-17 19:09 UTC pith:WVMNY27O

load-bearing objection PyTorch DDP paper delivers concrete implementation details and near-linear scaling results to 256 GPUs on standard models. the 1 major comments →

arxiv 2006.15704 v1 pith:WVMNY27O submitted 2020-06-28 cs.DC cs.LG

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith

show 3 more authors

Brian Vaughan Pritam Damania Soumith Chintala

This is my paper

classification cs.DC cs.LG

keywords distributed trainingdata parallelismPyTorchdeep learningGPU scalingall-reducegradient communication

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explains the design of PyTorch's built-in support for data-parallel training across multiple GPUs. It covers practical techniques including grouping gradients into buckets for fewer network messages, running communication in the background while layers compute, and skipping synchronization steps when gradients are unchanged. Evaluations confirm that these choices let training time drop almost in proportion to the number of GPUs, even at 256 devices. Readers would care because the ability to train large models quickly on big datasets is a core bottleneck in modern deep learning.

Core claim

By replicating the model on each worker, computing gradients locally, and then using bucketing plus asynchronous all-reduce to keep replicas identical, the PyTorch distributed data parallel module attains near-linear scalability when configured appropriately on up to 256 GPUs.

What carries the argument

Gradient bucketing combined with overlapping of backward computation and all-reduce communication.

Load-bearing premise

Typical deep learning models contain enough computation per layer for communication to be hidden behind it, and the network supports low-latency collective operations at the tested scale.

What would settle it

A benchmark run on 256 GPUs showing wall-clock training time that is more than 20 percent worse than the linear projection from a single-GPU baseline.

If this is right

Users can train larger models or use bigger batches without a proportional rise in wall-clock time.
Existing PyTorch code requires only a small wrapper to obtain these scaling benefits.
Clusters of commodity GPUs become practical for research that previously needed specialized hardware.
Communication overhead becomes a smaller fraction of total time as model size grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap and bucketing ideas could be ported to other frameworks that lack native support.
At still larger scales the assumption of low network latency would need re-examination.
Adding automatic detection of when to skip synchronization could further reduce overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

PyTorch DDP paper delivers concrete implementation details and near-linear scaling results to 256 GPUs on standard models.

read the letter

The main takeaway is that this paper documents how PyTorch's distributed data parallel module uses gradient bucketing, overlap of all-reduce with the backward pass, and selective synchronization to reach near-linear scaling on 256 GPUs for common models like ResNet and BERT-scale networks. The work is an engineering description rather than a new algorithm, but the specific integration choices and the reported measurements are what make it worth reading. They explain the dependencies between computation and communication clearly and show how the optimizations address them in practice. The evaluations provide actual numbers from runs at that scale, which is the kind of evidence that helps people who need to train larger models today. The paper does well at making these techniques usable inside the framework and at showing the performance impact with real hardware data. The soft spots are around the conditions required for the overlap to succeed. The scaling depends on having enough compute per bucket to hide the communication latency, which works for the high-FLOP models they tested but would likely degrade for thinner layers or smaller per-GPU batches. The paper does not supply a broad sensitivity analysis on the compute-to-communication ratio or bounds on when efficiency drops, so the claims are solid for the reported configurations but narrower in general applicability than the abstract might suggest. Network details also matter for exact reproduction. This paper is mainly for practitioners tuning distributed training in PyTorch or similar systems. Readers who want implementation-level guidance and scaling benchmarks will find it useful. It deserves a serious referee because the empirical results are concrete and the system changes are described in enough detail to be actionable.

Referee Report

1 major / 2 minor

Summary. The paper presents the design, implementation, and evaluation of PyTorch's distributed data parallel (DDP) module as of v1.5. It covers optimizations including gradient bucketing, overlapping computation with communication via async all-reduce, and skipping gradient synchronization. The central empirical claim is that these techniques, when configured appropriately, enable near-linear scalability on up to 256 GPUs for standard vision and language models such as ResNet and BERT-scale.

Significance. If the reported measurements hold, the work offers substantial practical value by documenting engineering choices and their performance impact in a widely adopted framework. The concrete scalability results on 256 GPUs constitute reproducible empirical evidence that can guide practitioners and inform similar systems designs; the absence of parameter fitting or invented axioms keeps the contribution grounded in direct measurement.

major comments (1)

[Evaluation] Evaluation section: the near-linear scalability claim at 256 GPUs rests on the assumption that per-bucket computation time exceeds all-reduce communication time on the target fabric. No sensitivity analysis or results are supplied for thinner models, smaller per-GPU batches, or lower arithmetic-intensity workloads where this overlap breaks, undermining the generality of the central claim.

minor comments (2)

[Abstract] Abstract and Evaluation: benchmark details (exact model sizes, per-GPU batch sizes, and network hardware specifications) should be stated explicitly to allow independent verification of the 256-GPU numbers.
[Implementation] Implementation: the description of how gradient bucketing interacts with the autograd engine could include a small diagram or pseudocode for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and the positive overall assessment of the work. We address the major comment point by point below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the near-linear scalability claim at 256 GPUs rests on the assumption that per-bucket computation time exceeds all-reduce communication time on the target fabric. No sensitivity analysis or results are supplied for thinner models, smaller per-GPU batches, or lower arithmetic-intensity workloads where this overlap breaks, undermining the generality of the central claim.

Authors: We agree that the reported near-linear scaling results rely on workloads where per-bucket computation time is sufficient to overlap with communication. The manuscript evaluates the techniques on standard models (ResNet-50 and BERT-scale) with batch sizes typical for those models on the target 256-GPU cluster, as these represent the practical use cases motivating the optimizations. The abstract and evaluation section qualify the claim with the phrase 'when configured appropriately.' We do not claim universality across all possible models and batch sizes. In a revised version we will add a short paragraph in the evaluation section explicitly noting the workload dependence of the overlap benefit and stating that for thinner models or very small per-GPU batches the same configuration may yield sub-linear scaling, thereby clarifying the scope of the central claim without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems paper with claims resting on direct measurements

full rationale

The paper describes the design and implementation of PyTorch's distributed data parallel module, including techniques such as gradient bucketing, overlapping computation with communication, and skipping synchronization. Its central claim of near-linear scalability on 256 GPUs is presented as the outcome of empirical evaluations on standard models rather than any derivation, equation, or fitted parameter. No mathematical predictions, self-definitional constructs, uniqueness theorems, or self-citation chains appear in the load-bearing steps. The work is self-contained as an engineering report whose results are directly falsifiable via reproduction on the reported hardware and workloads.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard distributed deep-learning assumptions rather than new axioms or invented entities. No free parameters are introduced to fit results; the work is an engineering report of existing techniques applied inside PyTorch.

axioms (2)

domain assumption Data parallelism replicates the full model on each device and averages gradients after each iteration to keep replicas consistent.
Invoked in the abstract description of the basic technique.
domain assumption Gradient communication can be overlapped with subsequent layer computation when the network and model structure permit.
Central to the overlapping optimization described.

pith-pipeline@v0.9.0 · 5513 in / 1362 out tokens · 73462 ms · 2026-05-17T19:09:51.262341+00:00 · methodology

0 comments

read the original abstract

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
cs.LG 2026-05 unverdicted novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
cs.DC 2026-05 unverdicted novelty 7.0

JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
cs.LG 2026-05 unverdicted novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 7.0

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
cs.LG 2026-04 unverdicted novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
cs.CV 2023-03 conditional novelty 7.0

BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumo...
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs
cs.LG 2026-06 unverdicted novelty 6.0

FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.
Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication
cs.DC 2026-06 unverdicted novelty 6.0

Splaxel achieves up to 7.6x speedup in distributed 3DGS training on scenes with up to 120M Gaussians by using pixel-level communication and visibility prediction while preserving reconstruction quality.
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
Exploiting Multicast for Accelerating Collective Communication
cs.DC 2026-05 unverdicted novelty 6.0

MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
cs.DC 2026-05 unverdicted novelty 6.0

JanusPipe is a new 3D-parallel training system for conservative MLIPs that uses SymFold and WaveK to achieve 1.51x and 1.45x average throughput gains over 1F1B and Hanayo on 32 GPUs.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Accelerating Compound LLM Training Workloads with Maestro
cs.DC 2026-05 unverdicted novelty 6.0

Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
cs.AI 2026-05 unverdicted novelty 6.0

PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
cs.LG 2026-04 unverdicted novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
Training Time Prediction for Mixed Precision-based Distributed Training
cs.LG 2026-04 unverdicted novelty 6.0

A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
veScale-FSDP: Flexible and High-Performance FSDP at Scale
cs.DC 2026-02 unverdicted novelty 6.0

veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...
Gradient-descent methods for scalable quantum detector tomography
quant-ph 2025-11 conditional novelty 6.0

Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
cs.DC 2025-04 unverdicted novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource...
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
cs.LG 2024-10 unverdicted novelty 6.0

Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training
cs.LG 2024-10 unverdicted novelty 6.0

Analog-SGD-AP converges with iteration complexity O(ε^{-2} + ε^{-1}) for multi-layer DNNs on AIMC hardware despite analog weight-update imperfections and asynchronous stale gradients.
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
cs.DC 2023-04 unverdicted novelty 6.0

PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads
cs.DC 2026-06 unverdicted novelty 5.0

Eidola is a gem5 extension that emulates cycle-level peer-to-peer GPU writes via real-application timing profiles to simulate traffic and synchronization in multi-GPU AI systems.
Piper: A Programmable Distributed Training System
cs.DC 2026-06 unverdicted novelty 5.0

Piper decouples user-defined distributed training strategies from runtime execution using transformations on a unified global training DAG IR, achieving parity on ZeRO and gains on composed strategies like DualPipe.
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
cs.DC 2026-05 unverdicted novelty 5.0

DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
cs.DC 2026-05 unverdicted novelty 5.0

Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
cs.DC 2026-05 unverdicted novelty 5.0

An adaptive DNN partitioning framework for heterogeneous edge-cloud systems reduces energy consumption by 27-36% and end-to-end latency by 6-23% versus static baselines on real hardware with VGG16, AlexNet, and MobileNetV2.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
math.OC 2026-05 unverdicted novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
ProTrain: Efficient LLM Training via Memory-Aware Techniques
cs.DC 2024-06 unverdicted novelty 5.0

ProTrain automates memory management for LLM training via cost models from profiling to deliver 1.43x-2.71x throughput gains over state-of-the-art systems without accuracy loss.
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
cs.DC 2026-05 unverdicted novelty 4.0

CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
cs.DC 2025-11 unverdicted novelty 4.0

Thermal imbalance in multi-GPU nodes creates hotter straggler GPUs that slow down cooler leader GPUs during overlapped computation and communication in LLM training.
Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
cs.PF 2026-05 unverdicted novelty 3.0

Discrete-event simulation finds optimal 10-100 km separation between AI clusters where hollow-core fiber provides 25% higher compute-communication overlap in geo-distributed data-parallel training.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 35 Pith papers · 6 internal anchors

[1]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

INTRODUCTION Deep Neural Networks (DNN) have powered a wide spec- trum of applications, ranging from image recognition [20], language translation [15], anomaly detection [16], content recommendation [38], to drug discovery [33], art genera- tion [28], game play [18], and self-driving cars [13]. Many applications pursue higher intelligence by optimizing la...

work page internal anchor Pith review Pith/arXiv arXiv 2006
[2]

Then, we explain and justify the idea of data parallelism and describe communication primitives

BACKGROUND Before diving into distributed training, let us brieﬂy dis- cuss the implementation and execution of local model train- ing using PyTorch. Then, we explain and justify the idea of data parallelism and describe communication primitives. 2.1 PyTorch PyTorch organizes values into Tensors which are generic n-dimensional arrays with a rich set of da...

work page
[3]

During distributed training, each pro- cess has its own local model replica and local optimizer

SYSTEM DESIGN PyTorch [30] provides aDistributedDataParallel (DDP1) module to help easily parallelize training across multiple pro- cesses and machines. During distributed training, each pro- cess has its own local model replica and local optimizer. In terms of correctness, distributed data parallel training and local training must be mathematically equiv...

work page
[4]

This section focus on the current status as of PyTorch v1.5.0

IMPLEMENTA TION The implementation of DDP has evolved several times in the past few releases. This section focus on the current status as of PyTorch v1.5.0. DDP implementation lives both in Python and C++ ﬁles, with Python exposing the API and composing non-performance-critical components, and C++ serving the core gradient reduction algorithm. The Python ...

work page
[5]

In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC

EV ALUA TION This section presents the evaluation results of PyTorch DDP using an exclusive 32 GPU cluster and a shared enti- tlement. In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC. All 4 servers reside in the same rack, and each server is equipped with 8 NVIDIA Tesla V100 GPUs. Fig. 5...

work page
[6]

We then present several ideas for future improvements

DISCUSSION This section discusses lessons learned from our experi- ments and past experiences. We then present several ideas for future improvements. 6.1 Lessons Learned Distributed data parallel training is a conceptually sim- ple or practically subtle framework. There are various tech- niques to improve its speed, creating a complex conﬁgura- tion space...

work page
[7]

Below are three popular categorizations

RELA TED WORK Distributed training algorithms can be categorized into diﬀerent types from diﬀerent perspectives. Below are three popular categorizations. • Synchronous update vs Asynchronous update: With the former, all model replicas can useAllReduce to col- lectively communicate gradients or parameters, while the asynchronous scheme employs P2P communic...

work page
[8]

DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping synchronizations

CONCLUSION This paper explained the design and implementation of the distributed data parallel module in PyTorch v1.5, and conducted performance evaluations on NCCL and Gloo back- end using ResNet50 and BERT models. DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping ...

work page
[9]

https://github.com/facebookincubator/gloo, 2019

Gloo: a collective communications library. https://github.com/facebookincubator/gloo, 2019

work page 2019
[10]

https://developer.nvidia.com/nccl, 2019

NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2019

work page 2019
[11]

https: //www.nvidia.com/en-us/data-center/nvlink/, 2019

NVLINK AND NVSWITCH: The Building Blocks of Advanced Multi-GPU Communication. https: //www.nvidia.com/en-us/data-center/nvlink/, 2019

work page 2019
[12]

https://www.open-mpi.org/, 2019

Open MPI: A High Performance Message Passing Library. https://www.open-mpi.org/, 2019

work page 2019
[13]

https://pybind11.readthedocs.io/, 2019

Pybind11: Seamless operability between C++11 and Python. https://pybind11.readthedocs.io/, 2019

work page 2019
[14]

https://pytorch.org/docs/master/rpc.html, 2019

PyTorch Distributed RPC Framework. https://pytorch.org/docs/master/rpc.html, 2019

work page 2019
[15]

https://pytorch.org/docs/stable/nn.html#torch

PyTorch Module forward Function. https://pytorch.org/docs/stable/nn.html#torch. nn.Module.forward, 2019

work page 2019
[16]

https://docs.scipy.org/, 2019

SciPy: open-source software for mathematics, science, and engineering. https://docs.scipy.org/, 2019

work page 2019
[17]

https://pytorch.org/docs/stable/nn.html#torch

PyTorch DistributedDataParallel. https://pytorch.org/docs/stable/nn.html#torch. nn.parallel.DistributedDataParallel, 2020

work page 2020
[18]

https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020

TensorFlow Distributed Training MultiWorkerMirroredStrategy. https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020

work page 2020
[19]

https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020

TensorFlow Distributed Training ParameterServerStrategy. https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020

work page 2020
[20]

Y. Bao, Y. Peng, Y. Chen, and C. Wu. Preemptive all-reduce scheduling for expediting distributed dnn training. In IEEE INFOCOM, 2020

work page 2020
[21]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter. Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development , 63(6):1–1, 2019

work page 2019
[23]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017

work page 2017
[25]

A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 , 2019

work page arXiv 1909
[26]

X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using oﬄine monte-carlo tree search planning. In Advances in neural information processing systems , pages 3338–3346, 2014

work page 2014
[27]

S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[29]

Huang, Y

Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Eﬃcient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pages 103–112, 2019

work page 2019
[30]

S. Jeaugey. Massively Scale Your Deep Learning Training with NCCL 2.4. https://devblogs.nvidia.com/ massively-scale-deep-learning-training-nccl-2-4/ , February 2019

work page 2019
[31]

Kim, G.-I

S. Kim, G.-I. Yu, H. Park, S. Cho, E. Jeong, H. Ha, S. Lee, J. S. Jeong, and B.-G. Chun. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–15, 2019

work page 2019
[32]

Kosaian, K

J. Kosaian, K. V. Rashmi, and S. Venkataraman. Parity models: Erasure-coded resilience for prediction serving systems. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , SOSP 19, page 3046, New York, NY, USA, 2019. Association for Computing Machinery

work page 2019
[33]

LeCun, C

Y. LeCun, C. Cortes, and C. Burges. The MNIST Database. http://yann.lecun.com/exdb/mnist/, 1999

work page 1999
[34]

LeCun, D

Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988

work page 1988
[35]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014

work page 2014
[36]

H. Mao, M. Cheung, and J. She. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM international conference on Multimedia , pages 1183–1191, 2017

work page 2017
[37]

Narayanan, A

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019
[38]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Sys...

work page 2019
[39]

Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16–29, 2019

work page 2019
[40]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[41]

Ramsundar, P

B. Ramsundar, P. Eastman, P. Walters, and V. Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. ” O’Reilly Media, Inc.”, 2019

work page 2019
[42]

Seide, H

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014

work page 2014
[43]

Horovod: fast and easy distributed deep learning in TensorFlow

A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Shazeer, Y

N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. Mesh-tensorﬂow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018

work page 2018
[45]

P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan. Gradientﬂow: Optimizing network performance for large-scale distributed dnn training. IEEE Transactions on Big Data , 2019

work page 2019
[46]

Van den Oord, S

A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems , pages 2643–2651, 2013

work page 2013
[47]

G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. Devanur, and I. Stoica. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940, 2019

work page arXiv 1910
[48]

J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Slowmo: Improving communication-eﬃcient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019. 14

work page arXiv 1910

[1] [1]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

INTRODUCTION Deep Neural Networks (DNN) have powered a wide spec- trum of applications, ranging from image recognition [20], language translation [15], anomaly detection [16], content recommendation [38], to drug discovery [33], art genera- tion [28], game play [18], and self-driving cars [13]. Many applications pursue higher intelligence by optimizing la...

work page internal anchor Pith review Pith/arXiv arXiv 2006

[2] [2]

Then, we explain and justify the idea of data parallelism and describe communication primitives

BACKGROUND Before diving into distributed training, let us brieﬂy dis- cuss the implementation and execution of local model train- ing using PyTorch. Then, we explain and justify the idea of data parallelism and describe communication primitives. 2.1 PyTorch PyTorch organizes values into Tensors which are generic n-dimensional arrays with a rich set of da...

work page

[3] [3]

During distributed training, each pro- cess has its own local model replica and local optimizer

SYSTEM DESIGN PyTorch [30] provides aDistributedDataParallel (DDP1) module to help easily parallelize training across multiple pro- cesses and machines. During distributed training, each pro- cess has its own local model replica and local optimizer. In terms of correctness, distributed data parallel training and local training must be mathematically equiv...

work page

[4] [4]

This section focus on the current status as of PyTorch v1.5.0

IMPLEMENTA TION The implementation of DDP has evolved several times in the past few releases. This section focus on the current status as of PyTorch v1.5.0. DDP implementation lives both in Python and C++ ﬁles, with Python exposing the API and composing non-performance-critical components, and C++ serving the core gradient reduction algorithm. The Python ...

work page

[5] [5]

In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC

EV ALUA TION This section presents the evaluation results of PyTorch DDP using an exclusive 32 GPU cluster and a shared enti- tlement. In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC. All 4 servers reside in the same rack, and each server is equipped with 8 NVIDIA Tesla V100 GPUs. Fig. 5...

work page

[6] [6]

We then present several ideas for future improvements

DISCUSSION This section discusses lessons learned from our experi- ments and past experiences. We then present several ideas for future improvements. 6.1 Lessons Learned Distributed data parallel training is a conceptually sim- ple or practically subtle framework. There are various tech- niques to improve its speed, creating a complex conﬁgura- tion space...

work page

[7] [7]

Below are three popular categorizations

RELA TED WORK Distributed training algorithms can be categorized into diﬀerent types from diﬀerent perspectives. Below are three popular categorizations. • Synchronous update vs Asynchronous update: With the former, all model replicas can useAllReduce to col- lectively communicate gradients or parameters, while the asynchronous scheme employs P2P communic...

work page

[8] [8]

DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping synchronizations

CONCLUSION This paper explained the design and implementation of the distributed data parallel module in PyTorch v1.5, and conducted performance evaluations on NCCL and Gloo back- end using ResNet50 and BERT models. DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping ...

work page

[9] [9]

https://github.com/facebookincubator/gloo, 2019

Gloo: a collective communications library. https://github.com/facebookincubator/gloo, 2019

work page 2019

[10] [10]

https://developer.nvidia.com/nccl, 2019

NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2019

work page 2019

[11] [11]

https: //www.nvidia.com/en-us/data-center/nvlink/, 2019

NVLINK AND NVSWITCH: The Building Blocks of Advanced Multi-GPU Communication. https: //www.nvidia.com/en-us/data-center/nvlink/, 2019

work page 2019

[12] [12]

https://www.open-mpi.org/, 2019

Open MPI: A High Performance Message Passing Library. https://www.open-mpi.org/, 2019

work page 2019

[13] [13]

https://pybind11.readthedocs.io/, 2019

Pybind11: Seamless operability between C++11 and Python. https://pybind11.readthedocs.io/, 2019

work page 2019

[14] [14]

https://pytorch.org/docs/master/rpc.html, 2019

PyTorch Distributed RPC Framework. https://pytorch.org/docs/master/rpc.html, 2019

work page 2019

[15] [15]

https://pytorch.org/docs/stable/nn.html#torch

PyTorch Module forward Function. https://pytorch.org/docs/stable/nn.html#torch. nn.Module.forward, 2019

work page 2019

[16] [16]

https://docs.scipy.org/, 2019

SciPy: open-source software for mathematics, science, and engineering. https://docs.scipy.org/, 2019

work page 2019

[17] [17]

https://pytorch.org/docs/stable/nn.html#torch

PyTorch DistributedDataParallel. https://pytorch.org/docs/stable/nn.html#torch. nn.parallel.DistributedDataParallel, 2020

work page 2020

[18] [18]

https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020

TensorFlow Distributed Training MultiWorkerMirroredStrategy. https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020

work page 2020

[19] [19]

https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020

TensorFlow Distributed Training ParameterServerStrategy. https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020

work page 2020

[20] [20]

Y. Bao, Y. Peng, Y. Chen, and C. Wu. Preemptive all-reduce scheduling for expediting distributed dnn training. In IEEE INFOCOM, 2020

work page 2020

[21] [21]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[22] [22]

M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter. Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development , 63(6):1–1, 2019

work page 2019

[23] [23]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017

work page 2017

[25] [25]

A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 , 2019

work page arXiv 1909

[26] [26]

X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using oﬄine monte-carlo tree search planning. In Advances in neural information processing systems , pages 3338–3346, 2014

work page 2014

[27] [27]

S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[29] [29]

Huang, Y

Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Eﬃcient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pages 103–112, 2019

work page 2019

[30] [30]

S. Jeaugey. Massively Scale Your Deep Learning Training with NCCL 2.4. https://devblogs.nvidia.com/ massively-scale-deep-learning-training-nccl-2-4/ , February 2019

work page 2019

[31] [31]

Kim, G.-I

S. Kim, G.-I. Yu, H. Park, S. Cho, E. Jeong, H. Ha, S. Lee, J. S. Jeong, and B.-G. Chun. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–15, 2019

work page 2019

[32] [32]

Kosaian, K

J. Kosaian, K. V. Rashmi, and S. Venkataraman. Parity models: Erasure-coded resilience for prediction serving systems. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , SOSP 19, page 3046, New York, NY, USA, 2019. Association for Computing Machinery

work page 2019

[33] [33]

LeCun, C

Y. LeCun, C. Cortes, and C. Burges. The MNIST Database. http://yann.lecun.com/exdb/mnist/, 1999

work page 1999

[34] [34]

LeCun, D

Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988

work page 1988

[35] [35]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014

work page 2014

[36] [36]

H. Mao, M. Cheung, and J. She. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM international conference on Multimedia , pages 1183–1191, 2017

work page 2017

[37] [37]

Narayanan, A

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019

[38] [38]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Sys...

work page 2019

[39] [39]

Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16–29, 2019

work page 2019

[40] [40]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[41] [41]

Ramsundar, P

B. Ramsundar, P. Eastman, P. Walters, and V. Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. ” O’Reilly Media, Inc.”, 2019

work page 2019

[42] [42]

Seide, H

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014

work page 2014

[43] [43]

Horovod: fast and easy distributed deep learning in TensorFlow

A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

Shazeer, Y

N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. Mesh-tensorﬂow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018

work page 2018

[45] [45]

P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan. Gradientﬂow: Optimizing network performance for large-scale distributed dnn training. IEEE Transactions on Big Data , 2019

work page 2019

[46] [46]

Van den Oord, S

A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems , pages 2643–2651, 2013

work page 2013

[47] [47]

G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. Devanur, and I. Stoica. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940, 2019

work page arXiv 1910

[48] [48]

J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Slowmo: Improving communication-eﬃcient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019. 14

work page arXiv 1910