Federated Learning with Non-IID Data

Damon Civin; Liangzhen Lai; Meng Li; Naveen Suda; Vikas Chandra; Yue Zhao

arxiv: 1806.00582 · v2 · pith:ILK7VY6Jnew · submitted 2018-06-02 · 💻 cs.LG · stat.ML

Federated Learning with Non-IID Data

Yue Zhao , Meng Li , Liangzhen Lai , Naveen Suda , Damon Civin , Vikas Chandra This is my paper

Pith reviewed 2026-05-16 10:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords federated learningnon-IID datadata sharingweight divergenceearth mover's distanceCIFAR-10accuracy recovery

0 comments

The pith

A small globally shared data subset recovers up to 30% accuracy lost to non-IID data in federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Federated learning trains a shared model across edge devices while keeping data local, but accuracy falls sharply when each device's data follows a different distribution. On highly skewed cases where every device sees only one class, the drop reaches 55%. The paper shows this loss stems from weight divergence between local models and quantifies it with the earth mover's distance to the global class distribution. Distributing a small shared data subset to all devices during training pulls updates toward the overall distribution and restores most of the lost performance.

Core claim

When each client trains only on samples from one class, federated averaging produces models whose accuracy falls by more than half compared with the IID case. The divergence of local weight vectors grows in proportion to the earth mover's distance between the device's class distribution and the population distribution. Introducing a globally shared data subset whose size is only a few percent of the total training set allows the averaged model to recover most of the lost accuracy.

What carries the argument

A small globally shared data subset that every client mixes with its local non-IID examples during training to reduce weight divergence.

Load-bearing premise

That a small globally shared data subset can be created and distributed without violating the privacy or regulatory constraints that motivated federated learning in the first place.

What would settle it

Re-training the CIFAR-10 models with exactly 0% versus 5% shared data and checking whether the accuracy difference reaches approximately 30%.

read the original abstract

Federated learning enables resource-constrained edge compute devices, such as mobile phones and IoT devices, to learn a shared model for prediction, while keeping the training data local. This decentralized approach to train models provides privacy, security, regulatory and economic benefits. In this work, we focus on the statistical challenge of federated learning when local data is non-IID. We first show that the accuracy of federated learning reduces significantly, by up to 55% for neural networks trained for highly skewed non-IID data, where each client device trains only on a single class of data. We further show that this accuracy reduction can be explained by the weight divergence, which can be quantified by the earth mover's distance (EMD) between the distribution over classes on each device and the population distribution. As a solution, we propose a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices. Experiments show that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that non-IID data distributions in federated learning cause large accuracy drops (up to 55% for neural networks when each client holds data from only one class), that this degradation can be explained by weight divergence quantified via Earth Mover's Distance (EMD) between local and population class distributions, and that sharing a small (5%) globally representative data subset across clients recovers up to 30% accuracy on CIFAR-10.

Significance. If the mitigation holds under realistic constraints, the work supplies useful early empirical baselines on the severity of non-IID effects in federated learning and a simple heuristic for partial recovery. The EMD diagnostic is interpretable and the reported gains on standard vision benchmarks are practically relevant, though the privacy feasibility of the shared-subset construction remains the central open question.

major comments (1)

Abstract and proposed strategy: the reported 30% accuracy gain on CIFAR-10 is obtained only after a 5% globally shared subset has been sampled from the full centralized training distribution and broadcast to every client. No procedure is given for constructing an equivalent representative subset when data never leaves the clients, directly contradicting the privacy premise stated in the introduction.

minor comments (1)

Experiments section: the abstract states clear empirical drops and recoveries but provides no error bars, number of random seeds, or ablation details on the EMD-weight-divergence causal link.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting an important practical limitation in our proposed strategy. We address the concern directly below and will revise the manuscript to clarify the assumptions and limitations.

read point-by-point responses

Referee: Abstract and proposed strategy: the reported 30% accuracy gain on CIFAR-10 is obtained only after a 5% globally shared subset has been sampled from the full centralized training distribution and broadcast to every client. No procedure is given for constructing an equivalent representative subset when data never leaves the clients, directly contradicting the privacy premise stated in the introduction.

Authors: We agree that the experimental construction of the 5% shared subset relies on sampling from the full centralized training distribution, which assumes a form of global access not available under strict per-client data isolation. This is a genuine limitation of the current presentation. In the revised manuscript we will (1) explicitly qualify the abstract and introduction to state that the shared subset is created from a centralized view in our experiments, (2) add a dedicated limitations paragraph discussing how the subset could be obtained in practice (e.g., via a small public dataset drawn from a similar distribution, synthetic data, or a trusted curator), and (3) note that the approach therefore represents a practical heuristic that relaxes the strictest privacy model rather than a fully decentralized solution. These changes will remove any implication of contradiction with the privacy premise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper's core claims rest on experimental measurements of accuracy under non-IID partitions and the effect of adding a shared data subset. No derivation chain, equations, or 'predictions' are presented that reduce by construction to fitted parameters or self-referential definitions. The weight divergence explanation invokes EMD as a standard metric between class distributions and the global one, without deriving EMD from the target result. No self-citations are load-bearing for uniqueness or ansatz. The reported accuracy gains are direct measurements conditional on the experimental protocol, not tautological outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard supervised learning assumptions and the empirical observation that a small shared subset is feasible; no new mathematical axioms or invented entities are introduced.

free parameters (1)

shared_data_fraction
The 5% figure is chosen experimentally to balance accuracy gain against communication cost; its exact value is not derived from first principles.

axioms (1)

domain assumption Local data distributions are fixed and known for the purpose of computing EMD to the global distribution.
Invoked when quantifying weight divergence; no justification or sensitivity analysis provided in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1190 out tokens · 37384 ms · 2026-05-16T10:18:46.227951+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
cs.LG 2026-05 unverdicted novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
cs.LG 2026-05 unverdicted novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
cs.CV 2026-05 unverdicted novelty 7.0

HeroCrystal uses single-image diffusion synthesis, probabilistic federated Faster R-CNN with contrastive debiasing, and inconsistent-category integration to reach 33.4% mAP in privacy-preserving multi-camera object detection.
Byzantine-Robust Distributed SGD: A Unified Analysis and Tight Error Bounds
math.OC 2026-04 unverdicted novelty 7.0

Unified convergence rates and tight lower bounds for Byzantine-robust distributed SGD under stochasticity and general data heterogeneity, showing local momentum reduces stochastic error floors.
On the Fragility of Data Attribution When Learning Is Distributed
cs.LG 2026-05 unverdicted novelty 6.0

A single adversary in distributed training inflates its attribution value via latent optimization on synthetic batches without degrading accuracy or triggering basic defenses.
FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning
cs.LG 2026-05 unverdicted novelty 6.0

FedVSSAM mitigates flatness incompatibility in SAM-based federated learning by consistently using a variance-suppressed adjusted direction for local perturbation, descent, and global updates, with non-convex convergen...
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
cs.CV 2026-05 unverdicted novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
cs.DC 2026-04 unverdicted novelty 6.0

AW-PSP dynamically weights node sampling by real-time availability predictions and failure correlations to improve robustness, label coverage, and fairness in federated learning under correlated device failures.
HierFedCEA: Hierarchical Federated Edge Learning for Privacy-Preserving Climate Control Optimization Across Heterogeneous Controlled Environment Agriculture Facilities
eess.SY 2026-04 unverdicted novelty 6.0

HierFedCEA delivers a hierarchical federated learning framework for privacy-preserving climate control optimization across heterogeneous CEA facilities, reaching 94% of centralized performance with under 1 MB communication.
Client-Conditional Federated Learning via Local Training Data Statistics
cs.LG 2026-03 unverdicted novelty 6.0

Conditioning a global FL model on local PCA statistics of client data matches oracle cluster performance across heterogeneous settings and is robust to sparse data with zero added communication.
Practical Quantum Federated Learning for Privacy-Sensitive Healthcare: Communication Efficiency and Noise Resilience
quant-ph 2026-03 unverdicted novelty 6.0

Hybrid QFL cuts quantum transmissions from 3TNMP to {3t + 2(T-t)}NMP over T rounds while preserving near-centralized convergence and improving depolarizing-noise resilience via decentralized aggregation and Steane-code QEC.
Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks
cs.LG 2026-01 unverdicted novelty 6.0

Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.
Task-agnostic Low-rank Residual Adaptation for Efficient Federated Continual Fine-Tuning
cs.LG 2025-05 unverdicted novelty 6.0

Fed-TaLoRA uses task-agnostic low-rank residual adaptation with post-aggregation calibration to enable efficient federated continual fine-tuning across sequential tasks under non-IID conditions.
FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning
cs.LG 2026-05 unverdicted novelty 5.0

FIRMA introduces Fibonacci ring aggregation protocols for server-free federated learning that maintain private heads and achieve higher accuracy than FedAvg under label skew across multiple benchmarks and heterogeneit...
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
cs.LG 2026-05 unverdicted novelty 5.0

Proactive client selection in federated learning via differentially private mutual information and simulated annealing to optimize Potential Federation Loss for utility and fairness.
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
cs.LG 2026-05 unverdicted novelty 5.0

Proposes proactive client selection via differentially private mutual information and Potential Federation Loss optimized by simulated annealing to achieve faster, fairer, and more accurate federated models than unifo...
FedSDR: Federated Self-Distillation with Rectification
cs.LG 2026-05 unverdicted novelty 5.0

FedSDR augments federated self-distillation with dual LoRA streams (local smoothing and global rectification) to produce globally aligned, factually faithful models under statistical heterogeneity.
FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement
cs.CR 2026-05 unverdicted novelty 5.0

FedSurrogate defends federated learning against backdoors by clustering on security-critical layers and substituting malicious updates with benign surrogates, reporting false-positive rates below 10% and attack succes...
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
math.OC 2026-05 unverdicted novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
cs.LG 2026-05 unverdicted novelty 5.0

CLAD is a clustered federated learning framework with a dual-mode architecture for joint anomaly detection and attack classification in IoT using labeled and unlabeled data.
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

HeroCrystal achieves 33.4% mAP on cross-domain multi-camera object detection by combining one-shot diffusion-based synthetic data generation, probabilistic federated Faster R-CNN, and inconsistent-category distillatio...
FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

FMCL performs one-shot class-aware client clustering in heterogeneous federated learning by deriving semantic signatures from foundation model embeddings and using cosine distance, yielding improved performance and st...
REVERB-FL: Server-Side Adversarial and Reserve-Enhanced Federated Learning for Robust Audio Classification
eess.AS 2025-12 unverdicted novelty 5.0

REVERB-FL uses a server-side reserve set with retraining and adversarial training to reduce poisoning effects and speed convergence in federated audio classification under non-IID data.
Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification
cs.LG 2019-09 unverdicted novelty 5.0

Non-identical data distributions degrade federated averaging accuracy on visual classification, but server momentum raises CIFAR-10 accuracy from 30.1% to 76.9% in the most skewed regimes.
Evaluating Federated Learning approaches for mammography under breast density heterogeneity
cs.LG 2026-05 unverdicted novelty 4.0

FedAvg matches centralized training accuracy on mammography data split by breast density heterogeneity, showing standard FL can handle this clinical variation without special fixes.
FedKPer: Tackling Generalization and Personalization in Medical Federated Learning via Knowledge Personalization
eess.IV 2026-05 unverdicted novelty 4.0

FedKPer improves the generalization-personalization trade-off in medical federated learning via local knowledge personalization and selective aggregation that emphasizes reliable updates.
Automating aggregation strategy selection in federated learning
cs.LG 2026-04 unverdicted novelty 4.0

A framework automates federated learning aggregation strategy selection via LLM inference in single-trial mode and genetic search in multi-trial mode, improving robustness under non-IID data.
Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. Data
cs.LG 2025-09 unverdicted novelty 4.0

Introduces M-DSL algorithm for distributed swarm learning that selects workers using a new non-i.i.d. degree metric to improve convergence and accuracy under data heterogeneity, with theoretical analysis and experimen...
Privacy-Preserving Federated Learning: Integrating Zero-Knowledge Proofs in Scalable Distributed Architectures
cs.DC 2026-05 unverdicted novelty 3.0

A hybrid federated learning architecture using zero-knowledge proofs for computation verification retains 94.2% accuracy under adversarial conditions across 1,000 nodes.
Split and Aggregation Learning for Foundation Models Over Mobile Embodied AI Network (MEAN): A Comprehensive Survey
cs.IT 2026-05 unverdicted novelty 3.0

The paper surveys split and aggregation learning for foundation models in 6G networks to improve efficiency, resource use, and data privacy in distributed AI.
Knowledge Distillation in Federated Learning: a Survey on Long Lasting Challenges and New Solutions
cs.LG 2024-06 unverdicted novelty 2.0

A survey organizing knowledge distillation techniques for addressing privacy, heterogeneity, communication, and personalization challenges in federated learning.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 29 Pith papers · 7 internal anchors

[1]

Hello Edge: Keyword Spotting on Microcontrollers

Y . Zhang, N. Suda, L. Lai, and V . Chandra, “Hello edge: Keyword spotting on microcontrollers,”arXiv preprint arXiv:1711.07128, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

L. Lai, N. Suda, and V . Chandra, “Cmsis-nn: Efﬁcient neural network kernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Communication-efﬁcient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efﬁcient learning of deep networks from decentralized data,” in Int. Conf. on Artiﬁcial Intelligence and Statistics, 2017

work page 2017
[4]

Federated Optimization:Distributed Optimization Beyond the Datacenter

J. Koneˇcn`y, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Federated learning: Collaborative machine learning without centralized training data,

H. B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized training data,” in Google, 2017

work page 2017
[6]

Gradient-based learning applied to document recognition,

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, pp. 2278–2324, 1998

work page 1998
[7]

Learning multiple layers of features from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” in Technical Report, University of Toronto, 2009

work page 2009
[8]

The complete works of william shakespeare,

W. Shakespeare, “The complete works of william shakespeare,” in https://www.gutenberg.org/ebooks/100

work page
[9]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191, ACM, 2017

work page 2017
[10]

Federated Learning: Strategies for Improving Communication Efficiency

J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efﬁciency,”arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Deep gradient compression: Reducing the communication bandwidth for distributed training

Y . Lin, S. Han, H. Mao, Y . Wang, and W. J. Dally, “Deep gradient compression: Reducing the communica- tion bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017

work page arXiv 2017
[12]

Adaptive subgradient methods for online learning and stochastic optimization,

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011

work page 2011
[13]

Divide the gradient by a running average of its recent magnitude. cours- era: Neural networks for machine learning,

T. Tieleman and G. Hinton, “Divide the gradient by a running average of its recent magnitude. cours- era: Neural networks for machine learning,” tech. rep., Technical Report. Available online: https://zh. coursera. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of- its-recent-magnitude (accessed on 21 April 2017)

work page 2017
[14]

A method for stochastic optimization,

D. Kinga and J. B. Adam, “A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015

work page 2015
[15]

Large-scale machine learning with stochastic gradient descent,

L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMP- STAT’2010, pp. 177–186, Springer, 2010

work page 2010
[16]

Making gradient descent optimal for strongly convex stochastic optimization.,

A. Rakhlin, O. Shamir, K. Sridharan,et al., “Making gradient descent optimal for strongly convex stochastic optimization.,” in ICML, Citeseer, 2012

work page 2012
[17]

Large scale distributed deep networks,

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V . Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems , pp. 1223–1231, 2012

work page 2012
[18]

Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

work page 2013
[19]

Federated multi-task learning,

V . Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” inAdvances in Neural Information Processing Systems, pp. 4427–4437, 2017

work page 2017
[20]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012

work page 2012
[22]

Fractional Max-Pooling

B. Graham, “Fractional max-pooling,” arXiv preprint arXiv:1412.6071, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

Qualitatively characterizing neural network optimization problems

I. J. Goodfellow, O. Vinyals, and A. M. Saxe, “Qualitatively characterizing neural network optimization problems,” in arXiv preprint arXiv:1412.6544, 2014. 9 A Appendix A.1 Test accuracy over communication rounds for a smaller batch size 0 100 200 300 400 500 Communication rounds 0.75 0.80 0.85 0.90 0.95 1.00Test accuracy (a) MNIST B=100 SGD B=10 E=1 IID ...

work page internal anchor Pith review Pith/arXiv arXiv 2014

[1] [1]

Hello Edge: Keyword Spotting on Microcontrollers

Y . Zhang, N. Suda, L. Lai, and V . Chandra, “Hello edge: Keyword spotting on microcontrollers,”arXiv preprint arXiv:1711.07128, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

L. Lai, N. Suda, and V . Chandra, “Cmsis-nn: Efﬁcient neural network kernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Communication-efﬁcient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efﬁcient learning of deep networks from decentralized data,” in Int. Conf. on Artiﬁcial Intelligence and Statistics, 2017

work page 2017

[4] [4]

Federated Optimization:Distributed Optimization Beyond the Datacenter

J. Koneˇcn`y, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

Federated learning: Collaborative machine learning without centralized training data,

H. B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized training data,” in Google, 2017

work page 2017

[6] [6]

Gradient-based learning applied to document recognition,

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, pp. 2278–2324, 1998

work page 1998

[7] [7]

Learning multiple layers of features from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” in Technical Report, University of Toronto, 2009

work page 2009

[8] [8]

The complete works of william shakespeare,

W. Shakespeare, “The complete works of william shakespeare,” in https://www.gutenberg.org/ebooks/100

work page

[9] [9]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191, ACM, 2017

work page 2017

[10] [10]

Federated Learning: Strategies for Improving Communication Efficiency

J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efﬁciency,”arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Deep gradient compression: Reducing the communication bandwidth for distributed training

Y . Lin, S. Han, H. Mao, Y . Wang, and W. J. Dally, “Deep gradient compression: Reducing the communica- tion bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017

work page arXiv 2017

[12] [12]

Adaptive subgradient methods for online learning and stochastic optimization,

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011

work page 2011

[13] [13]

Divide the gradient by a running average of its recent magnitude. cours- era: Neural networks for machine learning,

T. Tieleman and G. Hinton, “Divide the gradient by a running average of its recent magnitude. cours- era: Neural networks for machine learning,” tech. rep., Technical Report. Available online: https://zh. coursera. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of- its-recent-magnitude (accessed on 21 April 2017)

work page 2017

[14] [14]

A method for stochastic optimization,

D. Kinga and J. B. Adam, “A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015

work page 2015

[15] [15]

Large-scale machine learning with stochastic gradient descent,

L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMP- STAT’2010, pp. 177–186, Springer, 2010

work page 2010

[16] [16]

Making gradient descent optimal for strongly convex stochastic optimization.,

A. Rakhlin, O. Shamir, K. Sridharan,et al., “Making gradient descent optimal for strongly convex stochastic optimization.,” in ICML, Citeseer, 2012

work page 2012

[17] [17]

Large scale distributed deep networks,

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V . Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems , pp. 1223–1231, 2012

work page 2012

[18] [18]

Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

work page 2013

[19] [19]

Federated multi-task learning,

V . Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” inAdvances in Neural Information Processing Systems, pp. 4427–4437, 2017

work page 2017

[20] [20]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012

work page 2012

[22] [22]

Fractional Max-Pooling

B. Graham, “Fractional max-pooling,” arXiv preprint arXiv:1412.6071, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[23] [23]

Qualitatively characterizing neural network optimization problems

I. J. Goodfellow, O. Vinyals, and A. M. Saxe, “Qualitatively characterizing neural network optimization problems,” in arXiv preprint arXiv:1412.6544, 2014. 9 A Appendix A.1 Test accuracy over communication rounds for a smaller batch size 0 100 200 300 400 500 Communication rounds 0.75 0.80 0.85 0.90 0.95 1.00Test accuracy (a) MNIST B=100 SGD B=10 E=1 IID ...

work page internal anchor Pith review Pith/arXiv arXiv 2014