pith. machine review for the scientific record. sign in

arxiv: 2510.27486 · v3 · submitted 2025-10-31 · 💻 cs.LG · cs.AI

Recognition: unknown

FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords fedadamwtextttlocaladamwmodelsboldsymbolconvergenceeffectiveness
0
0 comments X
read the original abstract

AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L \Delta \sigma_l^2)/(S K R \epsilon^2)}+(L \Delta)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.

  2. FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

    cs.LG 2026-03 unverdicted novelty 7.0

    FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.

  3. DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

    cs.LG 2026-02 unverdicted novelty 7.0

    DP-FedAdamW delivers an unbiased second-moment estimator for AdamW in DPFL, proving linear convergence acceleration without heterogeneity assumptions and outperforming SOTA by 5.83% on Tiny-ImageNet with Swin-Base at ε=1.

  4. From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

    cs.LG 2026-04 unverdicted novelty 6.0

    FEAT mitigates representation collapse and prediction bias in federated continual learning by aligning feature angular similarities to shared Equiangular Tight Frame prototypes and removing task-irrelevant directional...

  5. Personalized Federated Learning for Gradient Alignment

    cs.LG 2026-05 unverdicted novelty 5.0

    pFLAlign uses two gradient alignment mechanisms derived from PAC-Bayesian analysis to reduce variance in local training and distortion in aggregation, yielding state-of-the-art personalization in federated learning.

  6. FedNSAM:Consistency of Local and Global Flatness for Federated Learning

    cs.LG 2026-02 unverdicted novelty 4.0

    FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.