PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Pith reviewed 2026-05-17 19:09 UTC · model grok-4.3
The pith
PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replicating the model on each worker, computing gradients locally, and then using bucketing plus asynchronous all-reduce to keep replicas identical, the PyTorch distributed data parallel module attains near-linear scalability when configured appropriately on up to 256 GPUs.
What carries the argument
Gradient bucketing combined with overlapping of backward computation and all-reduce communication.
If this is right
- Users can train larger models or use bigger batches without a proportional rise in wall-clock time.
- Existing PyTorch code requires only a small wrapper to obtain these scaling benefits.
- Clusters of commodity GPUs become practical for research that previously needed specialized hardware.
- Communication overhead becomes a smaller fraction of total time as model size grows.
Where Pith is reading between the lines
- The same overlap and bucketing ideas could be ported to other frameworks that lack native support.
- At still larger scales the assumption of low network latency would need re-examination.
- Adding automatic detection of when to skip synchronization could further reduce overhead.
Load-bearing premise
Typical deep learning models contain enough computation per layer for communication to be hidden behind it, and the network supports low-latency collective operations at the tested scale.
What would settle it
A benchmark run on 256 GPUs showing wall-clock training time that is more than 20 percent worse than the linear projection from a single-GPU baseline.
read the original abstract
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the design, implementation, and evaluation of PyTorch's distributed data parallel (DDP) module as of v1.5. It covers optimizations including gradient bucketing, overlapping computation with communication via async all-reduce, and skipping gradient synchronization. The central empirical claim is that these techniques, when configured appropriately, enable near-linear scalability on up to 256 GPUs for standard vision and language models such as ResNet and BERT-scale.
Significance. If the reported measurements hold, the work offers substantial practical value by documenting engineering choices and their performance impact in a widely adopted framework. The concrete scalability results on 256 GPUs constitute reproducible empirical evidence that can guide practitioners and inform similar systems designs; the absence of parameter fitting or invented axioms keeps the contribution grounded in direct measurement.
major comments (1)
- [Evaluation] Evaluation section: the near-linear scalability claim at 256 GPUs rests on the assumption that per-bucket computation time exceeds all-reduce communication time on the target fabric. No sensitivity analysis or results are supplied for thinner models, smaller per-GPU batches, or lower arithmetic-intensity workloads where this overlap breaks, undermining the generality of the central claim.
minor comments (2)
- [Abstract] Abstract and Evaluation: benchmark details (exact model sizes, per-GPU batch sizes, and network hardware specifications) should be stated explicitly to allow independent verification of the 256-GPU numbers.
- [Implementation] Implementation: the description of how gradient bucketing interacts with the autograd engine could include a small diagram or pseudocode for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the positive overall assessment of the work. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the near-linear scalability claim at 256 GPUs rests on the assumption that per-bucket computation time exceeds all-reduce communication time on the target fabric. No sensitivity analysis or results are supplied for thinner models, smaller per-GPU batches, or lower arithmetic-intensity workloads where this overlap breaks, undermining the generality of the central claim.
Authors: We agree that the reported near-linear scaling results rely on workloads where per-bucket computation time is sufficient to overlap with communication. The manuscript evaluates the techniques on standard models (ResNet-50 and BERT-scale) with batch sizes typical for those models on the target 256-GPU cluster, as these represent the practical use cases motivating the optimizations. The abstract and evaluation section qualify the claim with the phrase 'when configured appropriately.' We do not claim universality across all possible models and batch sizes. In a revised version we will add a short paragraph in the evaluation section explicitly noting the workload dependence of the overlap benefit and stating that for thinner models or very small per-GPU batches the same configuration may yield sub-linear scaling, thereby clarifying the scope of the central claim without requiring new experiments. revision: partial
Circularity Check
No circularity: empirical systems paper with claims resting on direct measurements
full rationale
The paper describes the design and implementation of PyTorch's distributed data parallel module, including techniques such as gradient bucketing, overlapping computation with communication, and skipping synchronization. Its central claim of near-linear scalability on 256 GPUs is presented as the outcome of empirical evaluations on standard models rather than any derivation, equation, or fitted parameter. No mathematical predictions, self-definitional constructs, uniqueness theorems, or self-citation chains appear in the load-bearing steps. The work is self-contained as an engineering report whose results are directly falsifiable via reproduction on the reported hardware and workloads.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Data parallelism replicates the full model on each device and averages gradients after each iteration to keep replicas consistent.
- domain assumption Gradient communication can be overlapped with subsequent layer computation when the network and model structure permit.
Forward citations
Cited by 18 Pith papers
-
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
-
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumo...
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
Accelerating Compound LLM Training Workloads with Maestro
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
-
Training Time Prediction for Mixed Precision-based Distributed Training
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
veScale-FSDP: Flexible and High-Performance FSDP at Scale
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...
-
Gradient-descent methods for scalable quantum detector tomography
Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.
-
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
An adaptive DNN partitioning framework for heterogeneous edge-cloud systems reduces energy consumption by 27-36% and end-to-end latency by 6-23% versus static baselines on real hardware with VGG16, AlexNet, and MobileNetV2.
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...
-
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
Thermal imbalance in multi-GPU nodes creates hotter straggler GPUs that slow down cooler leader GPUs during overlapped computation and communication in LLM training.
Reference graph
Works this paper leans on
-
[1]
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
INTRODUCTION Deep Neural Networks (DNN) have powered a wide spec- trum of applications, ranging from image recognition [20], language translation [15], anomaly detection [16], content recommendation [38], to drug discovery [33], art genera- tion [28], game play [18], and self-driving cars [13]. Many applications pursue higher intelligence by optimizing la...
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[2]
Then, we explain and justify the idea of data parallelism and describe communication primitives
BACKGROUND Before diving into distributed training, let us briefly dis- cuss the implementation and execution of local model train- ing using PyTorch. Then, we explain and justify the idea of data parallelism and describe communication primitives. 2.1 PyTorch PyTorch organizes values into Tensors which are generic n-dimensional arrays with a rich set of da...
-
[3]
During distributed training, each pro- cess has its own local model replica and local optimizer
SYSTEM DESIGN PyTorch [30] provides aDistributedDataParallel (DDP1) module to help easily parallelize training across multiple pro- cesses and machines. During distributed training, each pro- cess has its own local model replica and local optimizer. In terms of correctness, distributed data parallel training and local training must be mathematically equiv...
-
[4]
This section focus on the current status as of PyTorch v1.5.0
IMPLEMENTA TION The implementation of DDP has evolved several times in the past few releases. This section focus on the current status as of PyTorch v1.5.0. DDP implementation lives both in Python and C++ files, with Python exposing the API and composing non-performance-critical components, and C++ serving the core gradient reduction algorithm. The Python ...
-
[5]
EV ALUA TION This section presents the evaluation results of PyTorch DDP using an exclusive 32 GPU cluster and a shared enti- tlement. In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC. All 4 servers reside in the same rack, and each server is equipped with 8 NVIDIA Tesla V100 GPUs. Fig. 5...
-
[6]
We then present several ideas for future improvements
DISCUSSION This section discusses lessons learned from our experi- ments and past experiences. We then present several ideas for future improvements. 6.1 Lessons Learned Distributed data parallel training is a conceptually sim- ple or practically subtle framework. There are various tech- niques to improve its speed, creating a complex configura- tion space...
-
[7]
Below are three popular categorizations
RELA TED WORK Distributed training algorithms can be categorized into different types from different perspectives. Below are three popular categorizations. • Synchronous update vs Asynchronous update: With the former, all model replicas can useAllReduce to col- lectively communicate gradients or parameters, while the asynchronous scheme employs P2P communic...
-
[8]
CONCLUSION This paper explained the design and implementation of the distributed data parallel module in PyTorch v1.5, and conducted performance evaluations on NCCL and Gloo back- end using ResNet50 and BERT models. DDP accelerates training by aggregating gradients into buckets for communi- cation, overlapping communication with computation, and skipping ...
-
[9]
https://github.com/facebookincubator/gloo, 2019
Gloo: a collective communications library. https://github.com/facebookincubator/gloo, 2019
work page 2019
-
[10]
https://developer.nvidia.com/nccl, 2019
NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2019
work page 2019
-
[11]
https: //www.nvidia.com/en-us/data-center/nvlink/, 2019
NVLINK AND NVSWITCH: The Building Blocks of Advanced Multi-GPU Communication. https: //www.nvidia.com/en-us/data-center/nvlink/, 2019
work page 2019
-
[12]
https://www.open-mpi.org/, 2019
Open MPI: A High Performance Message Passing Library. https://www.open-mpi.org/, 2019
work page 2019
-
[13]
https://pybind11.readthedocs.io/, 2019
Pybind11: Seamless operability between C++11 and Python. https://pybind11.readthedocs.io/, 2019
work page 2019
-
[14]
https://pytorch.org/docs/master/rpc.html, 2019
PyTorch Distributed RPC Framework. https://pytorch.org/docs/master/rpc.html, 2019
work page 2019
-
[15]
https://pytorch.org/docs/stable/nn.html#torch
PyTorch Module forward Function. https://pytorch.org/docs/stable/nn.html#torch. nn.Module.forward, 2019
work page 2019
-
[16]
SciPy: open-source software for mathematics, science, and engineering. https://docs.scipy.org/, 2019
work page 2019
-
[17]
https://pytorch.org/docs/stable/nn.html#torch
PyTorch DistributedDataParallel. https://pytorch.org/docs/stable/nn.html#torch. nn.parallel.DistributedDataParallel, 2020
work page 2020
-
[18]
https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020
TensorFlow Distributed Training MultiWorkerMirroredStrategy. https://www.tensorflow.org/guide/distributed_ training#multiworkermirroredstrategy, 2020
work page 2020
-
[19]
https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020
TensorFlow Distributed Training ParameterServerStrategy. https://www.tensorflow.org/guide/distributed_ training#parameterserverstrategy, 2020
work page 2020
-
[20]
Y. Bao, Y. Peng, Y. Chen, and C. Wu. Preemptive all-reduce scheduling for expediting distributed dnn training. In IEEE INFOCOM, 2020
work page 2020
-
[21]
End to End Learning for Self-Driving Cars
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter. Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development , 63(6):1–1, 2019
work page 2019
-
[23]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017
work page 2017
- [25]
-
[26]
X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems , pages 3338–3346, 2014
work page 2014
-
[27]
S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
- [29]
-
[30]
S. Jeaugey. Massively Scale Your Deep Learning Training with NCCL 2.4. https://devblogs.nvidia.com/ massively-scale-deep-learning-training-nccl-2-4/ , February 2019
work page 2019
- [31]
-
[32]
J. Kosaian, K. V. Rashmi, and S. Venkataraman. Parity models: Erasure-coded resilience for prediction serving systems. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , SOSP 19, page 3046, New York, NY, USA, 2019. Association for Computing Machinery
work page 2019
- [33]
- [34]
-
[35]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014
work page 2014
-
[36]
H. Mao, M. Cheung, and J. She. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM international conference on Multimedia , pages 1183–1191, 2017
work page 2017
-
[37]
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019
work page 2019
-
[38]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Sys...
work page 2019
-
[39]
Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16–29, 2019
work page 2019
-
[40]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[41]
B. Ramsundar, P. Eastman, P. Walters, and V. Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. ” O’Reilly Media, Inc.”, 2019
work page 2019
- [42]
-
[43]
Horovod: fast and easy distributed deep learning in TensorFlow
A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018
work page 2018
-
[45]
P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan. Gradientflow: Optimizing network performance for large-scale distributed dnn training. IEEE Transactions on Big Data , 2019
work page 2019
-
[46]
A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems , pages 2643–2651, 2013
work page 2013
- [47]
- [48]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.