pith. sign in

arxiv: 2501.07526 · v2 · pith:T4XGT3RVnew · submitted 2025-01-13 · 💻 cs.DC · stat.ML

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

classification 💻 cs.DC stat.ML
keywords communicationfedavgparallelalgorithmsdistributed-memoryperformancestepalgorithm
0
0 comments X
read the original abstract

Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost. This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3\times$ over $s$-step SGD and speedups up to $121\times$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

    cs.DC 2026-06 unverdicted novelty 6.0

    Mixed-precision CA-SGD for GLMs on A100 GPUs matches FP32 loss within 0.5% while delivering 5.1-6.8x speedup via a nine-choice finite-precision error recipe.