hub Canonical reference

Horovod: fast and easy distributed deep learning in tensorflow

· 2018 · cs.LG · arXiv 1802.05799

Canonical reference. 89% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1

citation-polarity summary

background 8 baseline 1

representative citing papers

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials

cs.DC · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

cs.LG · 2025-07-09 · unverdicted · novelty 7.0

A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL · 2023-10-03 · unverdicted · novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

Exploiting Multicast for Accelerating Collective Communication

cs.DC · 2026-05-21 · unverdicted · novelty 6.0

MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training

cs.DC · 2025-09-25 · conditional · novelty 6.0

InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

cs.DC · 2025-08-29 · unverdicted · novelty 6.0

Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

cs.DC · 2025-04-14 · unverdicted · novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

cs.LG · 2024-10-26 · unverdicted · novelty 6.0

Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.

A Physics-Informed Neural Network for Solving the Quasi-static Magnetohydrodynamic Equations

physics.plasm-ph · 2026-04-22 · unverdicted · novelty 6.0

A PINN solves the time-dependent quasi-static MHD equations in axisymmetric tokamak geometry without training data and reproduces vertical plasma displacement seen in ground-truth simulations.

Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

cs.NI · 2026-04-18 · unverdicted · novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.

Diffusion-Based Point-Cloud Generation of Heavy-Ion Events

hep-ph · 2026-04-07 · unverdicted · novelty 6.0

A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

cs.DC · 2026-05-20 · unverdicted · novelty 5.0

DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.

DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting

cs.DC · 2026-02-18 · conditional · novelty 5.0

DistributedEstimator demonstrates that circuit cutting preserves test accuracy and robustness in QNN training on Iris and MNIST while revealing that classical reconstruction dominates runtime and exponential subcircuit growth limits scaling.

On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

cs.NI · 2025-10-22 · unverdicted · novelty 5.0

The paper analyzes the PerfBound power-saving mechanism for Ethernet-based interconnects, identifies weaknesses in dynamic power-down methods, and proposes an enhancement that improves energy reduction with minimal performance penalty.

Entanglement and Bell Nonlocality in $\tau^+ \tau^-$ at the LHC using Machine Learning for Neutrino Reconstruction

hep-ph · 2025-04-02 · unverdicted · novelty 5.0

Simulations of pp to tau+ tau- at the LHC with ML neutrino reconstruction show Bell nonlocality above 5 sigma, proposing tau pairs as a new benchmark system for quantum information studies.

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

cs.DC · 2023-03-09 · unverdicted · novelty 5.0

Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.

citing papers explorer

Showing 31 of 31 citing papers.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 126 · internal anchor
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials cs.DC · 2026-05-18 · unverdicted · none · ref 27 · 2 links · internal anchor
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning cs.LG · 2025-07-09 · unverdicted · none · ref 76 · internal anchor
A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding cs.DC · 2026-05-12 · unverdicted · none · ref 1
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Ring Attention with Blockwise Transformers for Near-Infinite Context cs.CL · 2023-10-03 · unverdicted · none · ref 36
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
Exploiting Multicast for Accelerating Collective Communication cs.DC · 2026-05-21 · unverdicted · none · ref 34 · internal anchor
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability cs.DC · 2026-05-18 · unverdicted · none · ref 47 · internal anchor
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload cs.DC · 2026-05-11 · unverdicted · none · ref 30 · 2 links · internal anchor
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 55 · 2 links · internal anchor
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training cs.DC · 2025-09-25 · conditional · none · ref 37 · internal anchor
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection cs.DC · 2025-08-29 · unverdicted · none · ref 14 · internal anchor
Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training cs.DC · 2025-04-14 · unverdicted · none · ref 60 · internal anchor
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading cs.LG · 2024-10-26 · unverdicted · none · ref 32 · internal anchor
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
ShardTensor: Domain Parallelism for Scientific Machine Learning cs.DC · 2026-05-11 · unverdicted · none · ref 51
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery cs.LG · 2026-05-11 · unverdicted · none · ref 58
AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.
A Physics-Informed Neural Network for Solving the Quasi-static Magnetohydrodynamic Equations physics.plasm-ph · 2026-04-22 · unverdicted · none · ref 37
A PINN solves the time-dependent quasi-static MHD equations in axisymmetric tokamak geometry without training data and reproduces vertical plasma displacement seen in ground-truth simulations.
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations cs.NI · 2026-04-18 · unverdicted · none · ref 73
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
Diffusion-Based Point-Cloud Generation of Heavy-Ion Events hep-ph · 2026-04-07 · unverdicted · none · ref 11
A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 80
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling cs.DC · 2026-05-20 · unverdicted · none · ref 10 · internal anchor
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting cs.DC · 2026-02-18 · conditional · none · ref 8 · internal anchor
DistributedEstimator demonstrates that circuit cutting preserves test accuracy and robustness in QNN training on Iris and MNIST while revealing that classical reconstruction dominates runtime and exponential subcircuit growth limits scaling.
On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers cs.NI · 2025-10-22 · unverdicted · none · ref 29 · internal anchor
The paper analyzes the PerfBound power-saving mechanism for Ethernet-based interconnects, identifies weaknesses in dynamic power-down methods, and proposes an enhancement that improves energy reduction with minimal performance penalty.
Entanglement and Bell Nonlocality in $\tau^+ \tau^-$ at the LHC using Machine Learning for Neutrino Reconstruction hep-ph · 2025-04-02 · unverdicted · none · ref 75 · internal anchor
Simulations of pp to tau+ tau- at the LHC with ML neutrino reconstruction show Bell nonlocality above 5 sigma, proposing tau pairs as a new benchmark system for quantum information studies.
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training cs.DC · 2023-03-09 · unverdicted · none · ref 19 · internal anchor
Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.
PyTorch Distributed: Experiences on Accelerating Data Parallel Training cs.DC · 2020-06-28 · accept · none · ref 43 · internal anchor
PyTorch distributed data parallel attains near-linear scalability on 256 GPUs through gradient bucketing, computation-communication overlap, and selective synchronization skipping.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 45
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale cs.DC · 2026-04-23 · unverdicted · none · ref 2
Optimizations to Petastorm and Parquet data pipelines with caching and deterministic queues reduce large-scale deep learning training time by 6x while raising GPU utilization above 60% and eliminating run-to-run variance.
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance econ.GN · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers cs.LG · 2026-05-09 · unverdicted · none · ref 38
This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
Adaptation of AI-accelerated CFD Simulations to the IPU platform cs.DC · 2026-05-01 · unverdicted · none · ref 15
Porting AI-accelerated CFD model training to IPU-POD16 yields 34% data-feeding speedup and scales throughput to 2805 samples/s on 16 IPUs despite inter-IPU communication limits.
Modulated learning for private and distributed regression with just a single sample per client device cs.LG · 2026-05-08 · unreviewed · ref 29

Horovod: fast and easy distributed deep learning in tensorflow

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer