FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

Fanhua Shang; Hongying Liu; Jin Liu; Liu junkang; Wei Feng; Yuanyuan Liu

arxiv: 2507.20016 · v1 · submitted 2025-07-26 · 💻 cs.LG · cs.AI

FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

Liu junkang , Yuanyuan Liu , Fanhua Shang , Hongying Liu , Jin Liu , Wei Feng This is my paper

Pith reviewed 2026-05-19 02:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords federated learninggeneralization boundsstochastic weight averagingdata heterogeneityFedMoSWAmomentum controlFedSAMnon-IID data

0 comments

The pith

FedMoSWA achieves smaller optimization and generalization errors than FedSAM and variants under high data heterogeneity in federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from the observation that FedSAM underperforms FedAvg when client data distributions differ sharply. It introduces FedSWA to apply stochastic weight averaging for flatter minima in this regime and then develops FedMoSWA, which adds momentum-based control to keep local updates aligned with the global model. Theoretical analysis supplies convergence rates and generalization bounds, together with a direct proof that FedMoSWA incurs strictly smaller optimization and generalization errors than FedSAM and related methods. Experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show consistent gains. If the bounds hold, federated training could become more reliable on the non-IID datasets that dominate practical deployments.

Core claim

The central claim is that stochastic weight averaging, when combined with momentum-based control in FedMoSWA, produces both faster convergence and tighter generalization bounds than FedSAM and its variants precisely when data heterogeneity is high, because the averaging step locates flatter minima while the momentum term reduces client-server model drift.

What carries the argument

Momentum-based stochastic controlled weight averaging, which periodically averages model weights with a momentum-adjusted control term to enforce alignment between local and global models while searching for flatter loss surfaces.

If this is right

Convergence analysis guarantees that FedMoSWA reaches a stationary point with smaller error than FedSAM after the same number of communication rounds.
Generalization bounds for FedMoSWA are strictly tighter than those derived for FedSAM when heterogeneity is large.
The same averaging-plus-momentum mechanism can be inserted into other sharpness-aware federated optimizers without changing their local update rules.
Empirical superiority on CIFAR-10/100 and Tiny ImageNet extends to any image classification task where client data partitions exhibit strong label or feature skew.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the momentum control term proves robust, it could be combined with other client drift mitigation techniques such as adaptive optimizers to further widen the performance gap.
The flat-minima hypothesis suggests testing whether FedMoSWA also improves robustness to adversarial perturbations on the same heterogeneous splits.
The proof technique for smaller errors may generalize to other averaging-based federated methods, offering a template for comparing any two FL algorithms under a shared heterogeneity measure.

Load-bearing premise

Stochastic weight averaging will locate flatter minima that improve generalization even when client data distributions are highly non-identical.

What would settle it

A controlled experiment in which FedMoSWA is run on the same high-heterogeneity partition used in the paper and the measured sharpness of the final minimum is not lower than that of FedSAM, or the test error is not smaller.

read the original abstract

For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts. Open source code at: https://github.com/junkangLiu0/FedSWA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedSWA and FedMoSWA add SWA plus momentum control to FedSAM-style methods and show better empirical results plus bounds under high heterogeneity, but the advantage rests on assumptions that may weaken exactly when client drift is strongest.

read the letter

The paper's core move is to notice that FedSAM can underperform FedAvg when data heterogeneity is high, then graft stochastic weight averaging onto the framework to target flatter minima and add a momentum term in FedMoSWA to keep local and global models from drifting apart. They supply convergence analysis and generalization bounds, and they prove that the new method has smaller optimization and generalization error than FedSAM and variants. Experiments on CIFAR-10/100 and Tiny ImageNet back the claim of superiority in the targeted regime, and the code is released.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FedSWA, which applies stochastic weight averaging to locate flatter minima and improve generalization in federated learning under high data heterogeneity, and FedMoSWA, which augments this with momentum-based stochastic controlled weight averaging to better align local and global models. It reports that FedSAM underperforms FedAvg in highly heterogeneous regimes, provides convergence analysis and generalization bounds for both algorithms, proves that the optimization and generalization errors of FedMoSWA are smaller than those of FedSAM and its variants, and shows empirical gains on CIFAR-10/100 and Tiny ImageNet.

Significance. If the bounds are tight and the empirical gains hold under proper controls, the work would offer a practical route to better generalization in non-i.i.d. federated settings by combining weight averaging with momentum control. The open-source code and explicit comparison to FedSAM variants are strengths that could aid reproducibility and adoption.

major comments (2)

[§4] §4 (Convergence Analysis): The claimed superiority of FedMoSWA's convergence rate over FedSAM variants rests on standard smoothness and bounded-gradient assumptions; these are precisely the conditions strained by the high-heterogeneity regime that motivates the paper, yet the derivation does not quantify how the momentum term keeps client drift bounded as heterogeneity increases.
[§5] §5 (Generalization Bounds): The proof that FedMoSWA yields strictly smaller generalization error than FedSAM (and variants) invokes the premise that stochastic weight averaging reliably reaches flatter minima; this premise is motivated by the empirical observation that FedSAM underperforms FedAvg, but the bound comparison step does not derive an explicit dependence on the heterogeneity parameter that would confirm the error reduction holds in the limit.

minor comments (2)

[Abstract] Abstract: 'real-word' should read 'real-world'.
[Algorithm 1] Algorithm 1 (FedMoSWA): the update rule for the controlled momentum term is presented without an explicit statement of the hyper-parameter range that guarantees stability under the heterogeneity levels tested in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the valuable comments and suggestions. We address the major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis): The claimed superiority of FedMoSWA's convergence rate over FedSAM variants rests on standard smoothness and bounded-gradient assumptions; these are precisely the conditions strained by the high-heterogeneity regime that motivates the paper, yet the derivation does not quantify how the momentum term keeps client drift bounded as heterogeneity increases.

Authors: We acknowledge that the convergence analysis relies on standard assumptions of smoothness and bounded gradients, which can be affected by high data heterogeneity. The momentum term in FedMoSWA is specifically designed to better align the local models with the global model, thereby controlling client drift. Although the current derivation demonstrates improved convergence rates, it does not provide an explicit bound on the drift in terms of the heterogeneity parameter. In the revised version, we will add a discussion and possibly a supporting lemma to illustrate how the momentum-based stochastic controlled weight averaging helps keep the client drift bounded as heterogeneity increases. This constitutes a partial revision. revision: partial
Referee: [§5] §5 (Generalization Bounds): The proof that FedMoSWA yields strictly smaller generalization error than FedSAM (and variants) invokes the premise that stochastic weight averaging reliably reaches flatter minima; this premise is motivated by the empirical observation that FedSAM underperforms FedAvg, but the bound comparison step does not derive an explicit dependence on the heterogeneity parameter that would confirm the error reduction holds in the limit.

Authors: The referee points out a valid aspect of our generalization analysis. The comparison of generalization errors assumes that SWA leads to flatter minima, which is empirically justified by the observation that FedSAM performs worse than FedAvg in highly heterogeneous settings. While we prove that FedMoSWA has smaller optimization and generalization errors than the counterparts, the bound does not explicitly show the dependence on the heterogeneity parameter. We will revise the generalization bounds section to include an explicit discussion or derivation step that relates the error reduction to the level of heterogeneity, confirming the advantage in the high-heterogeneity limit. This will be incorporated in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical bounds derived from standard FL assumptions

full rationale

The paper derives convergence rates and generalization bounds for FedSWA and FedMoSWA, then compares optimization and generalization errors to FedSAM variants. These steps rely on conventional smoothness, bounded-gradient, and heterogeneity assumptions typical in federated optimization literature rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The motivating observation that FedSAM underperforms FedAvg under high heterogeneity is empirical and external to the proof; the claimed error reductions follow from the stated lemmas without reducing by construction to quantities defined only inside the paper's own fitted values or prior self-citations. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard smoothness and bounded-gradient assumptions common to federated optimization papers plus the new algorithmic construction; no new physical entities are introduced.

axioms (1)

domain assumption Standard assumptions on gradient boundedness, smoothness, and data heterogeneity level for convergence and generalization analysis in federated learning.
Invoked to derive the stated convergence and error bounds.

pith-pipeline@v0.9.0 · 5769 in / 1321 out tokens · 58987 ms · 2026-05-19T02:13:53.949673+00:00 · methodology

FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)