pith. machine review for the scientific record. sign in

arxiv: 1802.05799 · v3 · submitted 2018-02-15 · 💻 cs.LG · stat.ML

Recognition: unknown

Horovod: fast and easy distributed deep learning in TensorFlow

Authors on Pith no claims yet
classification 💻 cs.LG stat.ML
keywords trainingcommunicationhorovodlibrarycodeinter-gputensorflowcomputation
0
0 comments X
read the original abstract

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

    cs.DC 2026-05 unverdicted novelty 7.0

    NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

  2. Ring Attention with Blockwise Transformers for Near-Infinite Context

    cs.CL 2023-10 unverdicted novelty 7.0

    Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

  3. ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

    cs.DC 2026-05 unverdicted novelty 6.0

    ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...

  4. ShardTensor: Domain Parallelism for Scientific Machine Learning

    cs.DC 2026-05 unverdicted novelty 6.0

    ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

  5. AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.

  6. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  7. A Physics-Informed Neural Network for Solving the Quasi-static Magnetohydrodynamic Equations

    physics.plasm-ph 2026-04 unverdicted novelty 6.0

    A PINN solves the time-dependent quasi-static MHD equations in axisymmetric tokamak geometry without training data and reproduces vertical plasma displacement seen in ground-truth simulations.

  8. Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

    cs.NI 2026-04 unverdicted novelty 6.0

    Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...

  9. Diffusion-Based Point-Cloud Generation of Heavy-Ion Events

    hep-ph 2026-04 unverdicted novelty 6.0

    A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.

  10. HybridFlow: A Flexible and Efficient RLHF Framework

    cs.LG 2024-09 unverdicted novelty 6.0

    HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

  11. Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

    cs.DC 2026-05 unverdicted novelty 5.0

    FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

  12. Modulated learning for private and distributed regression with just a single sample per client device

    cs.LG 2026-05 unverdicted novelty 5.0

    Single-sample clients add one calibrated noisy perturbation to their data point and share transformed representations, allowing the server to recover unbiased gradients for private distributed regression.

  13. Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

    cs.DC 2026-04 unverdicted novelty 4.0

    Optimizations to Petastorm and Parquet data pipelines with caching and deterministic queues reduce large-scale deep learning training time by 6x while raising GPU utilization above 60% and eliminating run-to-run variance.

  14. Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance

    econ.GN 2026-05 unverdicted novelty 3.0

    The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.

  15. Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

    cs.LG 2026-05 unverdicted novelty 3.0

    This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.

  16. Adaptation of AI-accelerated CFD Simulations to the IPU platform

    cs.DC 2026-05 unverdicted novelty 3.0

    Porting AI-accelerated CFD model training to IPU-POD16 yields 34% data-feeding speedup and scales throughput to 2805 samples/s on 16 IPUs despite inter-IPU communication limits.