FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging
Pith reviewed 2026-05-19 02:13 UTC · model grok-4.3
The pith
FedMoSWA achieves smaller optimization and generalization errors than FedSAM and variants under high data heterogeneity in federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that stochastic weight averaging, when combined with momentum-based control in FedMoSWA, produces both faster convergence and tighter generalization bounds than FedSAM and its variants precisely when data heterogeneity is high, because the averaging step locates flatter minima while the momentum term reduces client-server model drift.
What carries the argument
Momentum-based stochastic controlled weight averaging, which periodically averages model weights with a momentum-adjusted control term to enforce alignment between local and global models while searching for flatter loss surfaces.
If this is right
- Convergence analysis guarantees that FedMoSWA reaches a stationary point with smaller error than FedSAM after the same number of communication rounds.
- Generalization bounds for FedMoSWA are strictly tighter than those derived for FedSAM when heterogeneity is large.
- The same averaging-plus-momentum mechanism can be inserted into other sharpness-aware federated optimizers without changing their local update rules.
- Empirical superiority on CIFAR-10/100 and Tiny ImageNet extends to any image classification task where client data partitions exhibit strong label or feature skew.
Where Pith is reading between the lines
- If the momentum control term proves robust, it could be combined with other client drift mitigation techniques such as adaptive optimizers to further widen the performance gap.
- The flat-minima hypothesis suggests testing whether FedMoSWA also improves robustness to adversarial perturbations on the same heterogeneous splits.
- The proof technique for smaller errors may generalize to other averaging-based federated methods, offering a template for comparing any two FL algorithms under a shared heterogeneity measure.
Load-bearing premise
Stochastic weight averaging will locate flatter minima that improve generalization even when client data distributions are highly non-identical.
What would settle it
A controlled experiment in which FedMoSWA is run on the same high-heterogeneity partition used in the paper and the measured sharpness of the final minimum is not lower than that of FedSAM, or the test error is not smaller.
read the original abstract
For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts. Open source code at: https://github.com/junkangLiu0/FedSWA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FedSWA, which applies stochastic weight averaging to locate flatter minima and improve generalization in federated learning under high data heterogeneity, and FedMoSWA, which augments this with momentum-based stochastic controlled weight averaging to better align local and global models. It reports that FedSAM underperforms FedAvg in highly heterogeneous regimes, provides convergence analysis and generalization bounds for both algorithms, proves that the optimization and generalization errors of FedMoSWA are smaller than those of FedSAM and its variants, and shows empirical gains on CIFAR-10/100 and Tiny ImageNet.
Significance. If the bounds are tight and the empirical gains hold under proper controls, the work would offer a practical route to better generalization in non-i.i.d. federated settings by combining weight averaging with momentum control. The open-source code and explicit comparison to FedSAM variants are strengths that could aid reproducibility and adoption.
major comments (2)
- [§4] §4 (Convergence Analysis): The claimed superiority of FedMoSWA's convergence rate over FedSAM variants rests on standard smoothness and bounded-gradient assumptions; these are precisely the conditions strained by the high-heterogeneity regime that motivates the paper, yet the derivation does not quantify how the momentum term keeps client drift bounded as heterogeneity increases.
- [§5] §5 (Generalization Bounds): The proof that FedMoSWA yields strictly smaller generalization error than FedSAM (and variants) invokes the premise that stochastic weight averaging reliably reaches flatter minima; this premise is motivated by the empirical observation that FedSAM underperforms FedAvg, but the bound comparison step does not derive an explicit dependence on the heterogeneity parameter that would confirm the error reduction holds in the limit.
minor comments (2)
- [Abstract] Abstract: 'real-word' should read 'real-world'.
- [Algorithm 1] Algorithm 1 (FedMoSWA): the update rule for the controlled momentum term is presented without an explicit statement of the hyper-parameter range that guarantees stability under the heterogeneity levels tested in the experiments.
Simulated Author's Rebuttal
We thank the referee for the valuable comments and suggestions. We address the major comments point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Convergence Analysis): The claimed superiority of FedMoSWA's convergence rate over FedSAM variants rests on standard smoothness and bounded-gradient assumptions; these are precisely the conditions strained by the high-heterogeneity regime that motivates the paper, yet the derivation does not quantify how the momentum term keeps client drift bounded as heterogeneity increases.
Authors: We acknowledge that the convergence analysis relies on standard assumptions of smoothness and bounded gradients, which can be affected by high data heterogeneity. The momentum term in FedMoSWA is specifically designed to better align the local models with the global model, thereby controlling client drift. Although the current derivation demonstrates improved convergence rates, it does not provide an explicit bound on the drift in terms of the heterogeneity parameter. In the revised version, we will add a discussion and possibly a supporting lemma to illustrate how the momentum-based stochastic controlled weight averaging helps keep the client drift bounded as heterogeneity increases. This constitutes a partial revision. revision: partial
-
Referee: [§5] §5 (Generalization Bounds): The proof that FedMoSWA yields strictly smaller generalization error than FedSAM (and variants) invokes the premise that stochastic weight averaging reliably reaches flatter minima; this premise is motivated by the empirical observation that FedSAM underperforms FedAvg, but the bound comparison step does not derive an explicit dependence on the heterogeneity parameter that would confirm the error reduction holds in the limit.
Authors: The referee points out a valid aspect of our generalization analysis. The comparison of generalization errors assumes that SWA leads to flatter minima, which is empirically justified by the observation that FedSAM performs worse than FedAvg in highly heterogeneous settings. While we prove that FedMoSWA has smaller optimization and generalization errors than the counterparts, the bound does not explicitly show the dependence on the heterogeneity parameter. We will revise the generalization bounds section to include an explicit discussion or derivation step that relates the error reduction to the level of heterogeneity, confirming the advantage in the high-heterogeneity limit. This will be incorporated in the next manuscript version. revision: yes
Circularity Check
No significant circularity; theoretical bounds derived from standard FL assumptions
full rationale
The paper derives convergence rates and generalization bounds for FedSWA and FedMoSWA, then compares optimization and generalization errors to FedSAM variants. These steps rely on conventional smoothness, bounded-gradient, and heterogeneity assumptions typical in federated optimization literature rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The motivating observation that FedSAM underperforms FedAvg under high heterogeneity is empirical and external to the proof; the claimed error reductions follow from the stated lemmas without reducing by construction to quantities defined only inside the paper's own fitted values or prior self-citations. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions on gradient boundedness, smoothness, and data heterogeneity level for convergence and generalization analysis in federated learning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.