Horovod: fast and easy distributed deep learning in TensorFlow
read the original abstract
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod
This paper has not been read by Pith yet.
Forward citations
Cited by 34 Pith papers
-
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
-
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning
A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.
-
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
-
Exploiting Multicast for Accelerating Collective Communication
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
-
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal ...
-
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
JanusPipe is a new 3D-parallel training system for conservative MLIPs that uses SymFold and WaveK to achieve 1.51x and 1.45x average throughput gains over 1F1B and Hanayo on 32 GPUs.
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, ...
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery
AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
-
A Physics-Informed Neural Network for Solving the Quasi-static Magnetohydrodynamic Equations
A PINN solves the time-dependent quasi-static MHD equations in axisymmetric tokamak geometry without training data and reproduces vertical plasma displacement seen in ground-truth simulations.
-
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
-
Diffusion-Based Point-Cloud Generation of Heavy-Ion Events
A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.
-
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
-
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.
-
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource...
-
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
-
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
-
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
-
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
-
Modulated learning for private and distributed regression with just a single sample per client device
Single-sample clients add one calibrated noisy perturbation to their data point and share transformed representations, allowing the server to recover unbiased gradients for private distributed regression.
-
DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting
DistributedEstimator demonstrates that circuit cutting preserves test accuracy and robustness in QNN training on Iris and MNIST while revealing that classical reconstruction dominates runtime and exponential subcircui...
-
On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers
The paper analyzes the PerfBound power-saving mechanism for Ethernet-based interconnects, identifies weaknesses in dynamic power-down methods, and proposes an enhancement that improves energy reduction with minimal pe...
-
Entanglement and Bell Nonlocality in $\tau^+ \tau^-$ at the LHC using Machine Learning for Neutrino Reconstruction
Simulations of pp to tau+ tau- at the LHC with ML neutrino reconstruction show Bell nonlocality above 5 sigma, proposing tau pairs as a new benchmark system for quantum information studies.
-
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training
Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x spe...
-
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
PyTorch distributed data parallel attains near-linear scalability on 256 GPUs through gradient bucketing, computation-communication overlap, and selective synchronization skipping.
-
Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale
Optimizations to Petastorm and Parquet data pipelines with caching and deterministic queues reduce large-scale deep learning training time by 6x while raising GPU utilization above 60% and eliminating run-to-run variance.
-
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance
The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
-
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
-
Adaptation of AI-accelerated CFD Simulations to the IPU platform
Porting AI-accelerated CFD model training to IPU-POD16 yields 34% data-feeding speedup and scales throughput to 2805 samples/s on 16 IPUs despite inter-IPU communication limits.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.