hub

Deep gradient compression: Reducing the communication bandwidth for distributed training

· 2017 · arXiv 1712.01887

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 2

citation-polarity summary

background 2 use method 2

representative citing papers

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.

Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.

SignMuon: Communication-Efficient Distributed Muon Optimization

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong empirical results on CIFAR and nanoGPT.

FedSQ: Optimized Weight Averaging via Fixed Gating

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

FedSQ stabilizes federated weight averaging under heterogeneous data by fixing binary gating masks derived from a pretrained model's structure while optimizing only quantitative parameters.

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

cs.LG · 2026-01-30 · unverdicted · novelty 6.0

Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.

Federated Learning with Non-IID Data

cs.LG · 2018-06-02 · conditional · novelty 6.0

Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.

Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

math.OC · 2026-05-09 · unverdicted · novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.

DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training

cs.LG · 2026-05-03 · unverdicted · novelty 5.0

DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts while maintaining comparable test accuracy.

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

cs.DC · 2026-04-27 · unverdicted · novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy

cs.CV · 2026-04-25 · unverdicted · novelty 5.0

Adaptive bit-length schedulers plus Laplacian DP in non-IID FL reduce communicated data by up to 52.64% on MNIST and 45% on CIFAR-10 while keeping competitive accuracy and privacy.

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

cs.DC · 2023-03-09 · unverdicted · novelty 5.0

Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.

Leveraging Kernel Symmetry for Joint Compression and Error Mitigation in Edge Model Transfer

eess.SP · 2026-04-19 · unverdicted · novelty 4.0

A DoF codec exploiting kernel symmetries compresses neural models for noisy channels and projects received weights onto the symmetry subspace to mitigate errors, outperforming pruning on MNIST and CIFAR-10.

A Survey of Personalized Federated Foundation Models for Privacy-Preserving Recommendation

cs.LG · 2025-06-13 · unverdicted · novelty 3.0

A survey of personalization techniques and foundation model adaptations in federated settings for privacy-preserving recommendations, emphasizing their architectural intersection.

citing papers explorer

Showing 17 of 17 citing papers.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging cs.LG · 2026-05-20 · unverdicted · none · ref 164
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method cs.LG · 2026-05-18 · unverdicted · none · ref 162
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits math.OC · 2026-05-08 · unverdicted · none · ref 80
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling cs.LG · 2025-09-03 · unverdicted · none · ref 29
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics cs.LG · 2026-05-21 · unverdicted · none · ref 102
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity cs.LG · 2026-05-13 · unverdicted · none · ref 267
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
SignMuon: Communication-Efficient Distributed Muon Optimization cs.LG · 2026-05-04 · unverdicted · none · ref 19
SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong empirical results on CIFAR and nanoGPT.
FedSQ: Optimized Weight Averaging via Fixed Gating cs.LG · 2026-04-03 · unverdicted · none · ref 8
FedSQ stabilizes federated weight averaging under heterogeneous data by fixing binary gating masks derived from a pretrained model's structure while optimizing only quantitative parameters.
Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks cs.LG · 2026-01-30 · unverdicted · none · ref 49
Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.
Federated Learning with Non-IID Data cs.LG · 2018-06-02 · conditional · none · ref 11
Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction math.OC · 2026-05-09 · unverdicted · none · ref 160
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training cs.LG · 2026-05-03 · unverdicted · none · ref 8
DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts while maintaining comparable test accuracy.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training cs.DC · 2026-04-27 · unverdicted · none · ref 32
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy cs.CV · 2026-04-25 · unverdicted · none · ref 13
Adaptive bit-length schedulers plus Laplacian DP in non-IID FL reduce communicated data by up to 52.64% on MNIST and 45% on CIFAR-10 while keeping competitive accuracy and privacy.
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training cs.DC · 2023-03-09 · unverdicted · none · ref 8
Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.
Leveraging Kernel Symmetry for Joint Compression and Error Mitigation in Edge Model Transfer eess.SP · 2026-04-19 · unverdicted · none · ref 10
A DoF codec exploiting kernel symmetries compresses neural models for noisy channels and projects received weights onto the symmetry subspace to mitigate errors, outperforming pruning on MNIST and CIFAR-10.
A Survey of Personalized Federated Foundation Models for Privacy-Preserving Recommendation cs.LG · 2025-06-13 · unverdicted · none · ref 35
A survey of personalization techniques and foundation model adaptations in federated settings for privacy-preserving recommendations, emphasizing their architectural intersection.

Deep gradient compression: Reducing the communication bandwidth for distributed training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer