arxiv: 2604.12768 · v1 · submitted 2026-04-14 · 💻 cs.LG

Recognition: unknown

Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization

Li Shen , Yan Sun , Dacheng Tao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningclient driftpersonalized initializationgeneralization errorexcess risk analysisoptimization errorheterogeneous dataFedInit

0 comments

The pith

A reverse-direction personalized initialization in federated learning reduces client drift's effect mainly on generalization error rather than optimization error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that federated learning on heterogeneous data creates client drift from inconsistent local optima, which prior methods have observed but not fully analyzed. It introduces the FedInit algorithm that applies a personalized relaxed initialization at the start of each local training stage by shifting away from the global model in the opposite direction of the most recent local state. Through excess risk decomposition, the authors show this local inconsistency leaves optimization error largely unchanged while widening the generalization error bound. A sympathetic reader would care because the change requires no extra training or communication yet can be added to existing methods to raise test performance on non-uniform client datasets.

Core claim

The central claim is that the stage-wise personalized relaxed initialization in FedInit, achieved by moving the local state away from the current global state toward the reverse direction of the latest local state, alleviates the negative impact of client drift primarily by tightening the generalization error bound, as the excess risk analysis demonstrates that optimization error remains insensitive to this local inconsistency while the introduced divergence term isolates its effect on test error.

What carries the argument

The personalized relaxed initialization mechanism, which sets the local starting point by moving away from the global model in the reverse direction of the prior local update, carries the argument by enabling both a practical drift-mitigation step and a divergence term in the excess risk bound that separates generalization effects from optimization effects.

If this is right

FedInit achieves comparable accuracy to several advanced benchmarks without any added training or communication costs.
The stage-wise initialization can be plugged into other current federated learning algorithms to raise their generalization performance on heterogeneous data.
Optimization error in federated learning is insensitive to the local inconsistency caused by client drift.
Local inconsistency from client drift primarily affects the generalization error bound rather than convergence speed.
Applying the relaxed initialization at every training stage helps manage performance degradation on non-identical client datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the divergence term isolates generalization impact cleanly, the same reverse-direction initialization could be tested in other distributed settings that face data heterogeneity, such as collaborative filtering or multi-site medical models.
Emphasizing generalization bounds over raw optimization speed when designing federated methods may produce systems that generalize better under real-world non-uniform data partitions.
Varying the strength of data heterogeneity in experiments would reveal whether the reported generalization gains remain stable or depend on specific drift magnitudes.
The analysis suggests client drift harms final performance more by encouraging local overfitting than by preventing convergence to a shared optimum.

Load-bearing premise

The assumption that the reverse-direction personalized initialization reliably reduces the negative effects of client drift and that the introduced divergence term in the excess risk analysis accurately isolates its impact on generalization without hidden dependencies on data heterogeneity measures.

What would settle it

A controlled federated learning experiment on synthetic heterogeneous data where the optimization error component of excess risk changes substantially across initialization directions while the generalization component stays constant would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12768 by Dacheng Tao, Li Shen, Yan Sun.

**Figure 1.** Figure 1: Sensitivity studies of local intervals K and relaxed coefficient β of the FedInit method on CIFAR-10. To fairly compare their efficiency. 5.3 Sensitivity on K and β The excess risk and test error of FedInit indicates there exist best selections for local interval K and relaxed coefficient β, respectively. In this part, we test a series of selections to validate 16 [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗

**Figure 2.** Figure 2: The accuracy and divergence of different [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the ``client-drift'' problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of ``client drift'' and explore its substance in FL, in this paper, we first propose an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error in FL. Our studies show that optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound. Extensive experiments are conducted to validate its efficiency. The proposed FedInit method could achieve comparable results compared to several advanced benchmarks without any additional training or communication costs. Meanwhile, the stage-wise personalized relaxed initialization could also be incorporated into several current advanced algorithms to achieve higher generalization performance in the FL paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedInit is a zero-cost local initialization tweak for federated learning that claims client inconsistency mainly widens generalization error rather than optimization error, but the excess-risk decomposition may not cleanly separate those effects.

read the letter

The main point is that this paper gives a concrete, no-overhead change to how local clients start each round in federated learning. FedInit sets the initial local model by stepping away from the current global model in the opposite direction of the last local state. The authors then run an excess-risk decomposition and conclude that the resulting local inconsistency does not affect optimization error much but does loosen the generalization bound. They also show that the same initialization step can be dropped into other FL methods without extra cost and still improves results on standard benchmarks.

Referee Report

1 major / 0 minor

Summary. The paper proposes FedInit, a federated learning algorithm that uses a personalized relaxed initialization: at the start of each local training round, the client model is initialized by moving away from the current global model in the reverse direction of its most recent local state. It introduces an excess-risk decomposition that isolates a divergence term to analyze the effect of local inconsistency (client drift) on test error, claiming that this inconsistency leaves optimization error largely insensitive while primarily inflating the generalization error bound. Experiments on standard FL benchmarks show FedInit matches or exceeds several advanced baselines with no added communication or computation cost and can be plugged into existing methods to improve their generalization.

Significance. If the excess-risk analysis rigorously establishes that the divergence term cleanly separates the impact of local inconsistency from optimization error without residual coupling to heterogeneity measures, the work would supply both a zero-overhead practical heuristic and a useful conceptual distinction between optimization and generalization effects in non-i.i.d. FL. The plug-in compatibility with other algorithms is a concrete strength.

major comments (1)

[excess risk analysis] Excess risk analysis section: the central claim that 'optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound' requires an explicit derivation showing that the introduced divergence term is independent of standard heterogeneity quantities (client gradient variance, Wasserstein distance to global optimum, etc.). If the divergence is defined via an identity that still contains these quantities, the claimed separation is not independent and the conclusion that inconsistency 'mainly affects the generalization error bound' becomes circular.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the excess risk analysis below, providing clarification on the separation of terms and committing to revisions where needed to strengthen the presentation.

read point-by-point responses

Referee: Excess risk analysis section: the central claim that 'optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound' requires an explicit derivation showing that the introduced divergence term is independent of standard heterogeneity quantities (client gradient variance, Wasserstein distance to global optimum, etc.). If the divergence is defined via an identity that still contains these quantities, the claimed separation is not independent and the conclusion that inconsistency 'mainly affects the generalization error bound' becomes circular.

Authors: We appreciate this observation and agree that the independence of the divergence term merits an explicit derivation to avoid any perception of circularity. In the excess-risk decomposition (Section 4), the total excess risk is expressed as the sum of an optimization-error term, a generalization-error term, and an additive divergence term that isolates the effect of local inconsistency (client drift) arising from the personalized relaxed initialization. The optimization-error bound follows from standard federated convergence analysis and depends on heterogeneity measures such as client gradient variance; crucially, the divergence term is defined solely via the difference between the local initialization state and the global model at the start of each round, without embedding those heterogeneity quantities in its expression. Consequently, the divergence contributes exclusively to the generalization bound while leaving the optimization-error sensitivity unchanged. To make this separation fully transparent, we will insert a step-by-step derivation in the revised manuscript that explicitly shows the divergence term's independence from client gradient variance and Wasserstein distance to the global optimum. revision: yes

Circularity Check

0 steps flagged

No circularity: excess-risk analysis presented as independent study result

full rationale

The provided abstract and excerpts describe a new initialization method (FedInit) and an excess-risk decomposition that introduces a divergence term to separate optimization error (claimed insensitive to local inconsistency) from generalization error. No equations, self-citations, or definitions are shown that would make the divergence term reduce by construction to the same heterogeneity quantities it is asserted to isolate. The central claim is framed as an empirical and analytical finding rather than a tautological renaming or fitted input. Absent explicit reductions (e.g., divergence defined via client-drift identity that re-embeds the target measures), the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; no explicit constants, assumptions, or new constructs are named.

pith-pipeline@v0.9.0 · 5538 in / 1071 out tokens · 52920 ms · 2026-05-10T16:17:02.233608+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 18 canonical work pages

[1]

D. A. E. Acar, Y. Zhao, R. M. Navarro, M. Mattina, P. N. Whatmough, and V. Saligrama. Federated learning based on dynamic regularization.arXiv preprint arXiv:2111.04263,

work page arXiv
[2]

Caldarola, B

D. Caldarola, B. Caputo, and M. Ciccone. Improving generalization in federated learning by seeking flat minima. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, pages 654–672. Springer,

2022
[3]

Gener- alizable adversarial training via spectral normalization

F. Farnia, J. M. Zhang, and D. Tse. Generalizable adversarial training via spectral normal- ization.arXiv preprint arXiv:1811.07457,

work page arXiv
[4]

L. Gao, H. Fu, L. Li, Y. Chen, M. Xu, and C.-Z. Xu. Feddc: Federated learning with non-iid data via local drift decoupling and correction.arXiv preprint arXiv:2203.11751,

work page arXiv
[5]

37 Rethinking the Personalized Relaxed Initialization T.-M. H. Hsu, H. Qi, and M. Brown. Measuring the effects of non-identical data distribution for federated visual classification.arXiv preprint arXiv:1909.06335,

work page arXiv 1909
[6]

Huang, L

T. Huang, L. Shen, Y. Sun, W. Lin, and D. Tao. Fusion of global and local knowledge for personalized federated learning.arXiv preprint arXiv:2302.11051,

work page arXiv
[7]

Karimi, P

B. Karimi, P. Li, and X. Li. Layer-wise and dimension-wise locally adaptive federated learning.arXiv preprint arXiv:2110.00532,

work page arXiv
[8]

T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217,

work page arXiv
[9]

Y. Liu, Y. Sun, Z. Ding, L. Shen, B. Liu, and D. Tao. Enhance local consistency in federated learning: A multi-step inertial momentum approach.arXiv preprint arXiv:2302.05726,

work page arXiv
[10]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

38 Rethinking the Personalized Relaxed Initialization B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally- normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564,

work page Pith review arXiv
[11]

Patel and A

V. Patel and A. S. Berahas. Gradient descent in the absence of global lipschitz continuity of the gradients: Convergence, divergence and limitations of its continuous approximation. arXiv preprint arXiv:2210.02418,

work page arXiv
[12]

Adaptive federated optimization,

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Koneˇ cn` y, S. Kumar, and H. B. McMahan. Adaptive federated optimization.arXiv preprint arXiv:2003.00295,

work page arXiv 2003
[13]

N. Shi, F. Lai, R. A. Kontar, and M. Chowdhury. Fed-ensemble: Improving generalization through model ensembling in federated learning.arXiv preprint arXiv:2107.10663,

work page arXiv
[14]

Y. Shi, Y. Liu, K. Wei, L. Shen, X. Wang, and D. Tao. Make landscape flatter in dif- ferentially private federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24552–24562, 2023a. Y. Shi, L. Shen, K. Wei, Y. Sun, B. Yuan, X. Wang, and D. Tao. Improving the model consistency of decentralized federated...

work page arXiv
[15]

H. Sun, L. Shen, Q. Zhong, L. Ding, S. Chen, J. Sun, J. Li, G. Sun, and D. Tao. Adasam: Boosting sharpness-aware minimization with adaptive learning rate and momentum for training deep neural networks.arXiv preprint arXiv:2303.00565, 2023a. Y. Sun, L. Shen, S. Chen, L. Ding, and D. Tao. Dynamic regularized sharpness aware min- imization in federated learn...

work page arXiv
[16]

H. Wang, S. Marella, and J. Anderson. Fedadmm: A federated primal-dual algorithm allowing partial participation. In2022 IEEE 61st Conference on Decision and Control (CDC), pages 287–294. IEEE, 2022a. J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum.arXiv preprint arXiv:1910.00643,

work page arXiv 1910
[17]

J. Wang, Z. Xu, Z. Garrett, Z. Charles, L. Liu, and G. Joshi. Local adaptivity in federated learning: Convergence and consistency.arXiv preprint arXiv:2106.02305,

work page arXiv
[18]

J. Xu, S. Wang, L. Wang, and A. C.-C. Yao. Fedcm: Federated learning with client-level momentum.arXiv preprint arXiv:2106.10874,

work page arXiv
[19]

H. Yang, M. Fang, and J. Liu. Achieving linear speedup with partial worker participation in non-iid federated learning.arXiv preprint arXiv:2101.11203,

work page arXiv