LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
hub
Deep gradient compression: Reducing the communication bandwidth for distributed training
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong empirical results on CIFAR and nanoGPT.
FedSQ stabilizes federated weight averaging under heterogeneous data by fixing binary gating masks derived from a pretrained model's structure while optimizing only quantitative parameters.
Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.
Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts while maintaining comparable test accuracy.
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
Adaptive bit-length schedulers plus Laplacian DP in non-IID FL reduce communicated data by up to 52.64% on MNIST and 45% on CIFAR-10 while keeping competitive accuracy and privacy.
Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.
A DoF codec exploiting kernel symmetries compresses neural models for noisy channels and projects received weights onto the symmetry subspace to mitigate errors, outperforming pruning on MNIST and CIFAR-10.
A survey of personalization techniques and foundation model adaptations in federated settings for privacy-preserving recommendations, emphasizing their architectural intersection.
citing papers explorer
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
-
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
-
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
-
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
-
SignMuon: Communication-Efficient Distributed Muon Optimization
SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong empirical results on CIFAR and nanoGPT.
-
FedSQ: Optimized Weight Averaging via Fixed Gating
FedSQ stabilizes federated weight averaging under heterogeneous data by fixing binary gating masks derived from a pretrained model's structure while optimizing only quantitative parameters.
-
Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks
Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.
-
Federated Learning with Non-IID Data
Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts while maintaining comparable test accuracy.
-
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
-
Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy
Adaptive bit-length schedulers plus Laplacian DP in non-IID FL reduce communicated data by up to 52.64% on MNIST and 45% on CIFAR-10 while keeping competitive accuracy and privacy.
-
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training
Cloudless-Training proposes a two-layer serverless framework with elastic scheduling and two new synchronization strategies (ASGD-GA and inter-PS model averaging) that reports 9.2-24% cost reduction and up to 1.7x speedup for geo-distributed PS-based ML training.
-
Leveraging Kernel Symmetry for Joint Compression and Error Mitigation in Edge Model Transfer
A DoF codec exploiting kernel symmetries compresses neural models for noisy channels and projects received weights onto the symmetry subspace to mitigate errors, outperforming pruning on MNIST and CIFAR-10.
-
A Survey of Personalized Federated Foundation Models for Privacy-Preserving Recommendation
A survey of personalization techniques and foundation model adaptations in federated settings for privacy-preserving recommendations, emphasizing their architectural intersection.