hub Canonical reference

Horovod: fast and easy distributed deep learning in TensorFlow

· 2018 · cs.LG · arXiv 1802.05799

Canonical reference. 89% of citing Pith papers cite this work as background.

33 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1

citation-polarity summary

background 8 baseline 1

representative citing papers

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials

cs.DC · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

cs.LG · 2025-07-09 · unverdicted · novelty 7.0

A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL · 2023-10-03 · unverdicted · novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training

cs.DC · 2026-05-28 · unverdicted · novelty 6.0

ZEROGNN removes the host from the metadata-driven control loop in sampling-based GNN training, restoring CUDA Graph replayability and delivering up to 5.28x end-to-end speedup with near-100% GPU execution fraction.

Throughput-Optimized Networks at Scale

cs.NI · 2026-05-27 · unverdicted · novelty 6.0

TONS uses linear optimization and heuristics to synthesize deadlock-free network topologies and routing for datacenter AI training, reporting 2.1x and 1.6x geometric mean speedups over best TPU torus variants for uniform random and all-to-all traffic in simulation.

Exploiting Multicast for Accelerating Collective Communication

cs.DC · 2026-05-21 · unverdicted · novelty 6.0

MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training

cs.DC · 2025-09-25 · conditional · novelty 6.0

InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

cs.DC · 2025-08-29 · unverdicted · novelty 6.0

Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

cs.DC · 2025-04-14 · unverdicted · novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

cs.LG · 2024-10-26 · unverdicted · novelty 6.0

Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.

Modulated learning for private and distributed regression with just a single sample per client device

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Introduces modulated learning for private distributed regression allowing one sample per client via calibrated noise injection on samples and aggregation of transformed representations to achieve unbiased gradients in expectation.

A Physics-Informed Neural Network for Solving the Quasi-static Magnetohydrodynamic Equations

physics.plasm-ph · 2026-04-22 · unverdicted · novelty 6.0

A PINN solves the time-dependent quasi-static MHD equations in axisymmetric tokamak geometry without training data and reproduces vertical plasma displacement seen in ground-truth simulations.

Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

cs.NI · 2026-04-18 · unverdicted · novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.

Diffusion-Based Point-Cloud Generation of Heavy-Ion Events

hep-ph · 2026-04-07 · unverdicted · novelty 6.0

A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

cs.DC · 2026-05-20 · unverdicted · novelty 5.0

DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.

On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

cs.NI · 2025-10-22 · unverdicted · novelty 5.0

The paper analyzes the PerfBound power-saving mechanism for Ethernet-based interconnects, identifies weaknesses in dynamic power-down methods, and proposes an enhancement that improves energy reduction with minimal performance penalty.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Ring Attention with Blockwise Transformers for Near-Infinite Context cs.CL · 2023-10-03 · unverdicted · none · ref 36
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training cs.DC · 2023-03-09 · unverdicted · none · ref 19 · internal anchor
Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.

Horovod: fast and easy distributed deep learning in TensorFlow

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer