Introduces Randomized PipeDream abstraction yielding first nonconvex convergence bound for PipeDream and proves delay scales as S squared for S stages.
Don’t use large mini-b atches, use local SGD
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 2representative citing papers
High-probability generalization bounds for D-SGD are derived at the optimal rate O(1/sqrt(mn) log(1/δ)) via pointwise uniform stability across convex and non-convex settings.
Local MixVR achieves communication complexity scaling only with number of workers M, independent of total samples N, and outperforms Minibatch Accelerated SGD when M is smaller than order N to the 1/4.
Decentralized SGD and SGDA under Markovian sampling admit non-asymptotic generalization bounds that incorporate network topology, Markov mixing rates, and primal-dual dynamics.
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Analog over-the-air DSGD scheme in which a multi-antenna PS compensates for blind transmitters so that fading and noise vanish as antenna count grows.
FedInit uses reverse personalized initialization in FL to reduce client drift effects, showing via excess risk that inconsistency impacts generalization error more than optimization error.
citing papers explorer
-
Demystifying Pipeline Parallelism: First Theory for PipeDream
Introduces Randomized PipeDream abstraction yielding first nonconvex convergence bound for PipeDream and proves delay scales as S squared for S stages.
-
Unveiling High-Probability Generalization in Decentralized SGD
High-probability generalization bounds for D-SGD are derived at the optimal rate O(1/sqrt(mn) log(1/δ)) via pointwise uniform stability across convex and non-convex settings.
-
Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning
Local MixVR achieves communication complexity scaling only with number of workers M, independent of total samples N, and outperforms Minibatch Accelerated SGD when M is smaller than order N to the 1/4.
-
Stability and Generalization for Decentralized Markov SGD
Decentralized SGD and SGDA under Markovian sampling admit non-asymptotic generalization bounds that incorporate network topology, Markov mixing rates, and primal-dual dynamics.
-
Adaptive Federated Optimization
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
-
Collaborative Machine Learning at the Wireless Edge with Blind Transmitters
Analog over-the-air DSGD scheme in which a multi-antenna PS compensates for blind transmitters so that fading and noise vanish as antenna count grows.
-
Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization
FedInit uses reverse personalized initialization in FL to reduce client drift effects, showing via excess risk that inconsistency impacts generalization error more than optimization error.