Federated Learning with Non-IID Data
Pith reviewed 2026-05-16 10:18 UTC · model grok-4.3
The pith
A small globally shared data subset recovers up to 30% accuracy lost to non-IID data in federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When each client trains only on samples from one class, federated averaging produces models whose accuracy falls by more than half compared with the IID case. The divergence of local weight vectors grows in proportion to the earth mover's distance between the device's class distribution and the population distribution. Introducing a globally shared data subset whose size is only a few percent of the total training set allows the averaged model to recover most of the lost accuracy.
What carries the argument
A small globally shared data subset that every client mixes with its local non-IID examples during training to reduce weight divergence.
Load-bearing premise
That a small globally shared data subset can be created and distributed without violating the privacy or regulatory constraints that motivated federated learning in the first place.
What would settle it
Re-training the CIFAR-10 models with exactly 0% versus 5% shared data and checking whether the accuracy difference reaches approximately 30%.
read the original abstract
Federated learning enables resource-constrained edge compute devices, such as mobile phones and IoT devices, to learn a shared model for prediction, while keeping the training data local. This decentralized approach to train models provides privacy, security, regulatory and economic benefits. In this work, we focus on the statistical challenge of federated learning when local data is non-IID. We first show that the accuracy of federated learning reduces significantly, by up to 55% for neural networks trained for highly skewed non-IID data, where each client device trains only on a single class of data. We further show that this accuracy reduction can be explained by the weight divergence, which can be quantified by the earth mover's distance (EMD) between the distribution over classes on each device and the population distribution. As a solution, we propose a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices. Experiments show that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that non-IID data distributions in federated learning cause large accuracy drops (up to 55% for neural networks when each client holds data from only one class), that this degradation can be explained by weight divergence quantified via Earth Mover's Distance (EMD) between local and population class distributions, and that sharing a small (5%) globally representative data subset across clients recovers up to 30% accuracy on CIFAR-10.
Significance. If the mitigation holds under realistic constraints, the work supplies useful early empirical baselines on the severity of non-IID effects in federated learning and a simple heuristic for partial recovery. The EMD diagnostic is interpretable and the reported gains on standard vision benchmarks are practically relevant, though the privacy feasibility of the shared-subset construction remains the central open question.
major comments (1)
- Abstract and proposed strategy: the reported 30% accuracy gain on CIFAR-10 is obtained only after a 5% globally shared subset has been sampled from the full centralized training distribution and broadcast to every client. No procedure is given for constructing an equivalent representative subset when data never leaves the clients, directly contradicting the privacy premise stated in the introduction.
minor comments (1)
- Experiments section: the abstract states clear empirical drops and recoveries but provides no error bars, number of random seeds, or ablation details on the EMD-weight-divergence causal link.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting an important practical limitation in our proposed strategy. We address the concern directly below and will revise the manuscript to clarify the assumptions and limitations.
read point-by-point responses
-
Referee: Abstract and proposed strategy: the reported 30% accuracy gain on CIFAR-10 is obtained only after a 5% globally shared subset has been sampled from the full centralized training distribution and broadcast to every client. No procedure is given for constructing an equivalent representative subset when data never leaves the clients, directly contradicting the privacy premise stated in the introduction.
Authors: We agree that the experimental construction of the 5% shared subset relies on sampling from the full centralized training distribution, which assumes a form of global access not available under strict per-client data isolation. This is a genuine limitation of the current presentation. In the revised manuscript we will (1) explicitly qualify the abstract and introduction to state that the shared subset is created from a centralized view in our experiments, (2) add a dedicated limitations paragraph discussing how the subset could be obtained in practice (e.g., via a small public dataset drawn from a similar distribution, synthetic data, or a trusted curator), and (3) note that the approach therefore represents a practical heuristic that relaxes the strictest privacy model rather than a fully decentralized solution. These changes will remove any implication of contradiction with the privacy premise. revision: yes
Circularity Check
No significant circularity; empirical results independent of inputs
full rationale
The paper's core claims rest on experimental measurements of accuracy under non-IID partitions and the effect of adding a shared data subset. No derivation chain, equations, or 'predictions' are presented that reduce by construction to fitted parameters or self-referential definitions. The weight divergence explanation invokes EMD as a standard metric between class distributions and the global one, without deriving EMD from the target result. No self-citations are load-bearing for uniqueness or ansatz. The reported accuracy gains are direct measurements conditional on the experimental protocol, not tautological outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- shared_data_fraction
axioms (1)
- domain assumption Local data distributions are fixed and known for the purpose of computing EMD to the global distribution.
Forward citations
Cited by 31 Pith papers
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal uses single-image diffusion synthesis, probabilistic federated Faster R-CNN with contrastive debiasing, and inconsistent-category integration to reach 33.4% mAP in privacy-preserving multi-camera object detection.
-
Byzantine-Robust Distributed SGD: A Unified Analysis and Tight Error Bounds
Unified convergence rates and tight lower bounds for Byzantine-robust distributed SGD under stochasticity and general data heterogeneity, showing local momentum reduces stochastic error floors.
-
On the Fragility of Data Attribution When Learning Is Distributed
A single adversary in distributed training inflates its attribution value via latent optimization on synthetic batches without degrading accuracy or triggering basic defenses.
-
FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning
FedVSSAM mitigates flatness incompatibility in SAM-based federated learning by consistently using a variance-suppressed adjusted direction for local perturbation, descent, and global updates, with non-convex convergen...
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
AW-PSP dynamically weights node sampling by real-time availability predictions and failure correlations to improve robustness, label coverage, and fairness in federated learning under correlated device failures.
-
HierFedCEA: Hierarchical Federated Edge Learning for Privacy-Preserving Climate Control Optimization Across Heterogeneous Controlled Environment Agriculture Facilities
HierFedCEA delivers a hierarchical federated learning framework for privacy-preserving climate control optimization across heterogeneous CEA facilities, reaching 94% of centralized performance with under 1 MB communication.
-
Client-Conditional Federated Learning via Local Training Data Statistics
Conditioning a global FL model on local PCA statistics of client data matches oracle cluster performance across heterogeneous settings and is robust to sparse data with zero added communication.
-
Practical Quantum Federated Learning for Privacy-Sensitive Healthcare: Communication Efficiency and Noise Resilience
Hybrid QFL cuts quantum transmissions from 3TNMP to {3t + 2(T-t)}NMP over T rounds while preserving near-centralized convergence and improving depolarizing-noise resilience via decentralized aggregation and Steane-code QEC.
-
Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks
Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.
-
Task-agnostic Low-rank Residual Adaptation for Efficient Federated Continual Fine-Tuning
Fed-TaLoRA uses task-agnostic low-rank residual adaptation with post-aggregation calibration to enable efficient federated continual fine-tuning across sequential tasks under non-IID conditions.
-
FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning
FIRMA introduces Fibonacci ring aggregation protocols for server-free federated learning that maintain private heads and achieve higher accuracy than FedAvg under label skew across multiple benchmarks and heterogeneit...
-
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
Proactive client selection in federated learning via differentially private mutual information and simulated annealing to optimize Potential Federation Loss for utility and fairness.
-
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
Proposes proactive client selection via differentially private mutual information and Potential Federation Loss optimized by simulated annealing to achieve faster, fairer, and more accurate federated models than unifo...
-
FedSDR: Federated Self-Distillation with Rectification
FedSDR augments federated self-distillation with dual LoRA streams (local smoothing and global rectification) to produce globally aligned, factually faithful models under statistical heterogeneity.
-
FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement
FedSurrogate defends federated learning against backdoors by clustering on security-critical layers and substituting malicious updates with benign surrogates, reporting false-positive rates below 10% and attack succes...
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
CLAD is a clustered federated learning framework with a dual-mode architecture for joint anomaly detection and attack classification in IoT using labeled and unlabeled data.
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal achieves 33.4% mAP on cross-domain multi-camera object detection by combining one-shot diffusion-based synthetic data generation, probabilistic federated Faster R-CNN, and inconsistent-category distillatio...
-
FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning
FMCL performs one-shot class-aware client clustering in heterogeneous federated learning by deriving semantic signatures from foundation model embeddings and using cosine distance, yielding improved performance and st...
-
REVERB-FL: Server-Side Adversarial and Reserve-Enhanced Federated Learning for Robust Audio Classification
REVERB-FL uses a server-side reserve set with retraining and adversarial training to reduce poisoning effects and speed convergence in federated audio classification under non-IID data.
-
Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification
Non-identical data distributions degrade federated averaging accuracy on visual classification, but server momentum raises CIFAR-10 accuracy from 30.1% to 76.9% in the most skewed regimes.
-
Evaluating Federated Learning approaches for mammography under breast density heterogeneity
FedAvg matches centralized training accuracy on mammography data split by breast density heterogeneity, showing standard FL can handle this clinical variation without special fixes.
-
FedKPer: Tackling Generalization and Personalization in Medical Federated Learning via Knowledge Personalization
FedKPer improves the generalization-personalization trade-off in medical federated learning via local knowledge personalization and selective aggregation that emphasizes reliable updates.
-
Automating aggregation strategy selection in federated learning
A framework automates federated learning aggregation strategy selection via LLM inference in single-trial mode and genetic search in multi-trial mode, improving robustness under non-IID data.
-
Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. Data
Introduces M-DSL algorithm for distributed swarm learning that selects workers using a new non-i.i.d. degree metric to improve convergence and accuracy under data heterogeneity, with theoretical analysis and experimen...
-
Privacy-Preserving Federated Learning: Integrating Zero-Knowledge Proofs in Scalable Distributed Architectures
A hybrid federated learning architecture using zero-knowledge proofs for computation verification retains 94.2% accuracy under adversarial conditions across 1,000 nodes.
-
Split and Aggregation Learning for Foundation Models Over Mobile Embodied AI Network (MEAN): A Comprehensive Survey
The paper surveys split and aggregation learning for foundation models in 6G networks to improve efficiency, resource use, and data privacy in distributed AI.
-
Knowledge Distillation in Federated Learning: a Survey on Long Lasting Challenges and New Solutions
A survey organizing knowledge distillation techniques for addressing privacy, heterogeneity, communication, and personalization challenges in federated learning.
Reference graph
Works this paper leans on
-
[1]
Hello Edge: Keyword Spotting on Microcontrollers
Y . Zhang, N. Suda, L. Lai, and V . Chandra, “Hello edge: Keyword spotting on microcontrollers,”arXiv preprint arXiv:1711.07128, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs
L. Lai, N. Suda, and V . Chandra, “Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Communication-efficient learning of deep networks from decentralized data,
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Int. Conf. on Artificial Intelligence and Statistics, 2017
work page 2017
-
[4]
Federated Optimization:Distributed Optimization Beyond the Datacenter
J. Koneˇcn`y, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
Federated learning: Collaborative machine learning without centralized training data,
H. B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized training data,” in Google, 2017
work page 2017
-
[6]
Gradient-based learning applied to document recognition,
Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, pp. 2278–2324, 1998
work page 1998
-
[7]
Learning multiple layers of features from tiny images,
A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” in Technical Report, University of Toronto, 2009
work page 2009
-
[8]
The complete works of william shakespeare,
W. Shakespeare, “The complete works of william shakespeare,” in https://www.gutenberg.org/ebooks/100
-
[9]
Practical secure aggregation for privacy-preserving machine learning,
K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191, ACM, 2017
work page 2017
-
[10]
Federated Learning: Strategies for Improving Communication Efficiency
J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,”arXiv preprint arXiv:1610.05492, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Deep gradient compression: Reducing the communication bandwidth for distributed training
Y . Lin, S. Han, H. Mao, Y . Wang, and W. J. Dally, “Deep gradient compression: Reducing the communica- tion bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017
-
[12]
Adaptive subgradient methods for online learning and stochastic optimization,
J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011
work page 2011
-
[13]
T. Tieleman and G. Hinton, “Divide the gradient by a running average of its recent magnitude. cours- era: Neural networks for machine learning,” tech. rep., Technical Report. Available online: https://zh. coursera. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of- its-recent-magnitude (accessed on 21 April 2017)
work page 2017
-
[14]
A method for stochastic optimization,
D. Kinga and J. B. Adam, “A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015
work page 2015
-
[15]
Large-scale machine learning with stochastic gradient descent,
L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMP- STAT’2010, pp. 177–186, Springer, 2010
work page 2010
-
[16]
Making gradient descent optimal for strongly convex stochastic optimization.,
A. Rakhlin, O. Shamir, K. Sridharan,et al., “Making gradient descent optimal for strongly convex stochastic optimization.,” in ICML, Citeseer, 2012
work page 2012
-
[17]
Large scale distributed deep networks,
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V . Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems , pp. 1223–1231, 2012
work page 2012
-
[18]
Stochastic first-and zeroth-order methods for nonconvex stochastic programming,
S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013
work page 2013
-
[19]
Federated multi-task learning,
V . Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” inAdvances in Neural Information Processing Systems, pp. 4427–4437, 2017
work page 2017
-
[20]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012
work page 2012
-
[22]
B. Graham, “Fractional max-pooling,” arXiv preprint arXiv:1412.6071, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[23]
Qualitatively characterizing neural network optimization problems
I. J. Goodfellow, O. Vinyals, and A. M. Saxe, “Qualitatively characterizing neural network optimization problems,” in arXiv preprint arXiv:1412.6544, 2014. 9 A Appendix A.1 Test accuracy over communication rounds for a smaller batch size 0 100 200 300 400 500 Communication rounds 0.75 0.80 0.85 0.90 0.95 1.00Test accuracy (a) MNIST B=100 SGD B=10 E=1 IID ...
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.