FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

Junkang Liu , Fanhua Shang , Hongying Liu , Yuxuan Tian , Yuanyuan Liu , Jin Liu , Kewen Zhu , Zhouchen Lin

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords fedadamwtextttlocaladamwmodelsboldsymbolconvergenceeffectiveness

read the original abstract

AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L \Delta \sigma_l^2)/(S K R \epsilon^2)}+(L \Delta)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
cs.LG 2026-03 unverdicted novelty 7.0

FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
cs.LG 2026-02 unverdicted novelty 7.0

DP-FedAdamW delivers an unbiased second-moment estimator for AdamW in DPFL, proving linear convergence acceleration without heterogeneity assumptions and outperforming SOTA by 5.83% on Tiny-ImageNet with Swin-Base at ε=1.
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
cs.LG 2026-04 unverdicted novelty 6.0

FEAT mitigates representation collapse and prediction bias in federated continual learning by aligning feature angular similarities to shared Equiangular Tight Frame prototypes and removing task-irrelevant directional...
Personalized Federated Learning for Gradient Alignment
cs.LG 2026-05 unverdicted novelty 5.0

pFLAlign uses two gradient alignment mechanisms derived from PAC-Bayesian analysis to reduce variance in local training and distortion in aggregation, yielding state-of-the-art personalization in federated learning.
FedNSAM:Consistency of Local and Global Flatness for Federated Learning
cs.LG 2026-02 unverdicted novelty 4.0

FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.