Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev , Mike Del Balso

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords trainingcommunicationhorovodlibrarycodeinter-gputensorflowcomputation

read the original abstract

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
cs.DC 2026-05 unverdicted novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Ring Attention with Blockwise Transformers for Near-Infinite Context
cs.CL 2023-10 unverdicted novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
cs.DC 2026-05 unverdicted novelty 6.0

ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery
cs.LG 2026-05 unverdicted novelty 6.0

AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
A Physics-Informed Neural Network for Solving the Quasi-static Magnetohydrodynamic Equations
physics.plasm-ph 2026-04 unverdicted novelty 6.0

A PINN solves the time-dependent quasi-static MHD equations in axisymmetric tokamak geometry without training data and reproduces vertical plasma displacement seen in ground-truth simulations.
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
cs.NI 2026-04 unverdicted novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
Diffusion-Based Point-Cloud Generation of Heavy-Ion Events
hep-ph 2026-04 unverdicted novelty 6.0

A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.
HybridFlow: A Flexible and Efficient RLHF Framework
cs.LG 2024-09 unverdicted novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
cs.DC 2026-05 unverdicted novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Modulated learning for private and distributed regression with just a single sample per client device
cs.LG 2026-05 unverdicted novelty 5.0

Single-sample clients add one calibrated noisy perturbation to their data point and share transformed representations, allowing the server to recover unbiased gradients for private distributed regression.
Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale
cs.DC 2026-04 unverdicted novelty 4.0

Optimizations to Petastorm and Parquet data pipelines with caching and deterministic queues reduce large-scale deep learning training time by 6x while raising GPU utilization above 60% and eliminating run-to-run variance.
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance
econ.GN 2026-05 unverdicted novelty 3.0

The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
cs.LG 2026-05 unverdicted novelty 3.0

This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
Adaptation of AI-accelerated CFD Simulations to the IPU platform
cs.DC 2026-05 unverdicted novelty 3.0

Porting AI-accelerated CFD model training to IPU-POD16 yields 34% data-feeding speedup and scales throughput to 2805 samples/s on 16 IPUs despite inter-IPU communication limits.