Gossipgrad: Scal- able deep learning using gossip communication based asynchronous gradient descent

Jeﬀ Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya · 2018 · cs.DC · arXiv 1803.05880

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during the feedforward phase in SGD to prevent over-fitting, 5) asynchronous communication of gradients for further reducing the communication cost of SGD and GossipGraD. We implement GossipGraD for GPU and CPU clusters and use NVIDIA GPUs (Pascal P100) connected with InfiniBand, and Intel Knights Landing (KNL) connected with Aries network. We evaluate GossipGraD using well-studied dataset ImageNet-1K (~250GB), and widely studied neural network topologies such as GoogLeNet and ResNet50 (current winner of ImageNet Large Scale Visualization Research Challenge (ILSVRC)). Our performance evaluation using both KNL and Pascal GPUs indicates that GossipGraD can achieve perfect efficiency for these datasets and their associated neural network topologies. Specifically, for ResNet50, GossipGraD is able to achieve ~100% compute efficiency using 128 NVIDIA Pascal P100 GPUs - while matching the top-1 classification accuracy published in literature.

representative citing papers

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

math.OC · 2026-04-27 · unverdicted · novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is

citing papers explorer

Showing 2 of 2 citing papers.

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training cs.DC · 2026-05-15 · unverdicted · none · ref 25 · internal anchor
Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon math.OC · 2026-04-27 · unverdicted · none · ref 5
SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is

Gossipgrad: Scal- able deep learning using gossip communication based asynchronous gradient descent

fields

years

verdicts

representative citing papers

citing papers explorer