arxiv: 2605.06446 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing

Junye Du , Zhenghao Li , Yushi Feng , Long Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningtransformerattention mechanismclient drifttwo-stage optimizationkernel freezingheterogeneous clientslinear attention

0 comments

The pith

Freezing the query and key blocks after warm-up lets value blocks optimize under a fixed attention kernel, cutting client drift in heterogeneous federated Transformer training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes the attention module so that query and key weights set the kernel while value weights apply transformations under it. It introduces FedFrozen as a two-stage procedure: full-model warm-up followed by freezing the query-key block and continuing to train only the value block. Under a linear-attention model the first stage approximates descent on a regularized kernel-profile objective and the second stage becomes a simpler value-block problem with the kernel held constant. Analysis identifies a direct trade-off controlled by warm-up length, and experiments show the method improves both stability and final performance when data distributions differ across clients.

Core claim

FedFrozen performs full-model warm-up training followed by freezing the query/key block while continuing to optimize the value block. Under a linear-attention formulation the warm-up stage acts as an inexact descent procedure on a regularized kernel-profile objective, while the frozen stage reduces to a restricted value-block optimization problem under a fixed attention kernel. The length of the warm-up governs an explicit trade-off between kernel consistency and value adaptation.

What carries the argument

Decomposition of attention into a query/key block that determines the attention kernel and a value block that performs semantic transformation under that kernel; freezing the query/key block after warm-up to hold the kernel fixed.

If this is right

The warm-up stage functions as inexact descent on a regularized kernel-profile objective.
The frozen stage becomes a restricted optimization of value blocks under a fixed kernel.
Warm-up duration controls a predictable bias-drift trade-off.
Simulations confirm the expected bias-drift behavior.
Real-data runs show improved stability and effectiveness for Transformer models under client heterogeneity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The kernel-value separation could be tested in attention variants beyond the linear case or in other model components that exhibit similar pattern-versus-content structure.
The two-stage schedule might be adapted to non-Transformer architectures in federated settings where some layers set routing or weighting and others perform the actual computation.
The explicit trade-off equation could be used to derive automatic warm-up schedulers rather than manual choices.
Further experiments could check whether the same freezing step helps when heterogeneity arises from label shift rather than feature shift.

Load-bearing premise

The linear-attention formulation captures the essential behavior of real attention mechanisms in heterogeneous federated settings.

What would settle it

Simulations in which varying the warm-up length produces no observable change in the predicted bias-drift trade-off, or real experiments on heterogeneous client data where FedFrozen yields no stability or accuracy gain over standard full-model federated training.

Figures

Figures reproduced from arXiv: 2605.06446 by Junye Du, Long Feng, Yushi Feng, Zhenghao Li.

**Figure 1.** Figure 1: Simulation results for Phase-1 profile descent under different client heterogeneity levels. view at source ↗

**Figure 2.** Figure 2: Simulation results for local-update sensitivity, warm-up selection, and robustness to client view at source ↗

**Figure 3.** Figure 3: (a) Warm-up robustness across heterogeneity levels view at source ↗

read the original abstract

Federated learning with heterogeneous clients remains a significant challenge for deep learning, primarily due to client drift arising from inconsistent local updates. Existing federated optimization methods typically address this issue through objective-level regularization or update-correction mechanisms. Recent studies, however, suggest that Transformer-based architectures may be inherently more robust than conventional models under heterogeneous federated training. Motivated by this observation, we investigate how different parameter components within the attention mechanism influence federated optimization. Specifically, we decompose the attention module into a query/key block, which determines the attention kernel, and a value block, which performs semantic transformation under the induced kernel. Based on this perspective, we propose FedFrozen, a two-stage federated optimization framework that first performs full-model warm-up training and then freezes the query/key block while continuing to optimize the value block. Under a linear-attention formulation, we show that the warm-up stage can be interpreted as an inexact descent procedure on a regularized kernel-profile objective, while the frozen stage reduces to a restricted value-block optimization problem under a fixed attention kernel. Our analysis further reveals an explicit trade-off that governs the choice of warm-up length. Simulations validate the predicted bias-drift behavior, and real-data experiments demonstrate that FedFrozen improves both the stability and effectiveness of Transformer models in heterogeneous federated learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedFrozen offers a simple two-stage freezing of query/key attention blocks after warm-up to stabilize federated Transformer training, with linear-attention analysis that may not directly explain the softmax experiments.

read the letter

The main thing to know about this paper is that FedFrozen proposes freezing the query and key blocks in attention after an initial warm-up phase during federated training of Transformers. This is intended to mitigate client drift in heterogeneous client settings. The novelty lies in this two-stage approach and the accompanying analysis under a linear-attention model. They show that the warm-up can be seen as inexact descent on a regularized kernel-profile objective, while the frozen stage becomes value-block optimization with a fixed kernel. They also identify a trade-off for choosing the warm-up length. The paper includes simulations for the predicted behavior and real-data experiments claiming better stability and effectiveness. It does well in presenting a structurally simple method that avoids heavy regularization or correction terms. The decomposition of attention into kernel-determining and transformation parts provides a clean way to think about what to freeze. The soft spots center on the linear-attention approximation. Real experiments use standard softmax attention, where the kernel is not fixed in the same manner, and there's no direct verification that the analyzed bias-drift or trade-off holds up. This leaves open whether the improvements are due to the proposed mechanism. The lack of detailed baselines and metrics in the summary also makes it tough to judge the empirical strength. This work is for people in federated learning and distributed optimization who deal with large models. A reader looking for straightforward techniques to improve FL on Transformers would find it worth considering. I recommend sending it for peer review. The idea has potential and the analysis is a reasonable starting point, even with the gaps.

Referee Report

2 major / 2 minor

Summary. The paper proposes FedFrozen, a two-stage federated optimization framework for Transformer models under client heterogeneity. The first stage performs full-model warm-up training; the second freezes the query/key blocks (which determine the attention kernel) while continuing to optimize the value block. Under a linear-attention formulation, the warm-up stage is interpreted as an inexact descent procedure on a regularized kernel-profile objective, while the frozen stage reduces to restricted value-block optimization under a fixed kernel. The analysis yields an explicit trade-off governing warm-up length. Simulations validate the predicted bias-drift behavior, and real-data experiments are claimed to demonstrate improved stability and effectiveness of Transformer models in heterogeneous federated learning.

Significance. If the central claims hold, the work supplies both a practical two-stage procedure that can reduce client drift in federated Transformer training and an interpretable decomposition of attention parameters that explains why freezing the kernel-determining blocks is beneficial. The explicit warm-up-length trade-off is a concrete, actionable contribution. The linear-attention analysis, while limited in scope, offers a clean theoretical lens that could guide future architecture-aware federated methods.

major comments (2)

[§3.2] §3.2 and Eq. (12)–(15): The claim that the frozen stage reduces to a restricted value-block optimization under a fixed attention kernel is derived explicitly from the linear-attention formulation that separates query/key (kernel) from value (transformation). Real-data experiments in §5, however, employ standard softmax attention, for which the kernel remains data-dependent and is not fixed by the same decomposition. No ablation or fidelity check is reported that confirms the predicted bias-drift or warm-up trade-off survives this change, leaving the mechanistic explanation for the observed gains unsupported.
[§5] §5: The manuscript states that real-data experiments demonstrate improvements in stability and effectiveness, yet provides no information on the concrete baselines (FedAvg, FedProx, etc.), evaluation metrics, number of independent runs, or statistical significance tests. Because these details are load-bearing for the empirical support of the central claim, the current presentation does not allow assessment of whether the gains are robust or merely anecdotal.

minor comments (2)

[Introduction] The introduction would benefit from a short paragraph explicitly contrasting the proposed decomposition with prior work on attention parameter roles in federated settings.
[§3.1] Notation for the kernel-profile objective (Eq. (8)) could be clarified by stating whether the regularization term is applied client-wise or globally.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important points about the scope of the theoretical analysis and the presentation of empirical results. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2 and Eq. (12)–(15): The claim that the frozen stage reduces to a restricted value-block optimization under a fixed attention kernel is derived explicitly from the linear-attention formulation that separates query/key (kernel) from value (transformation). Real-data experiments in §5, however, employ standard softmax attention, for which the kernel remains data-dependent and is not fixed by the same decomposition. No ablation or fidelity check is reported that confirms the predicted bias-drift or warm-up trade-off survives this change, leaving the mechanistic explanation for the observed gains unsupported.

Authors: We agree that the explicit reduction to restricted value-block optimization under a fixed kernel holds only under the linear-attention model used for the analysis in §3.2. This formulation was selected to obtain a clean decomposition and derive the warm-up-length trade-off. The simulations directly validate the predicted bias-drift behavior under the same model. For the real-data experiments with standard softmax attention, the gains are empirical. In the revision we will add a dedicated paragraph in §3 and §5 clarifying that the linear-attention analysis supplies an interpretable mechanistic lens rather than an exact equivalence for softmax, and we will include a new ablation on a controlled task that applies the FedFrozen procedure to softmax attention while tracking the warm-up trade-off. This will strengthen the link between theory and practice. revision: partial
Referee: [§5] §5: The manuscript states that real-data experiments demonstrate improvements in stability and effectiveness, yet provides no information on the concrete baselines (FedAvg, FedProx, etc.), evaluation metrics, number of independent runs, or statistical significance tests. Because these details are load-bearing for the empirical support of the central claim, the current presentation does not allow assessment of whether the gains are robust or merely anecdotal.

Authors: We apologize for the omission of these essential experimental details. In the revised manuscript we will expand §5 with: (i) explicit list of baselines (FedAvg, FedProx, FedAdam, and local-only training), (ii) evaluation metrics (test accuracy and rounds-to-convergence), (iii) number of independent runs (five runs with distinct random seeds), and (iv) statistical significance via paired t-tests with reported p-values and standard deviations. We will also add error bars to all plots and a summary table. These additions will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a decomposition of attention into query/key (kernel-determining) and value blocks, proposes the two-stage FedFrozen procedure, and states that under a linear-attention formulation the warm-up stage corresponds to inexact descent on a regularized kernel-profile objective while the frozen stage reduces to value-block optimization under a fixed kernel. This is presented as a derived mathematical interpretation rather than a restatement of inputs by construction. No load-bearing self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are evident from the abstract or context. The analysis remains self-contained against the stated linear-attention model, with separate simulation and real-data validation steps that do not collapse into the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract; the ledger captures the explicit modeling choices mentioned. The warm-up length is presented as governed by an explicit trade-off, implying it functions as a tunable hyperparameter. The linear-attention model is invoked to derive the stage interpretations.

free parameters (1)

warm-up length
The paper states that an explicit trade-off governs the choice of warm-up length, indicating it is selected or tuned rather than derived parameter-free.

axioms (2)

domain assumption Linear-attention formulation accurately represents the behavior of the attention mechanism for the purpose of analyzing federated optimization stages
Invoked to interpret the warm-up stage as inexact descent on a regularized kernel-profile objective and the frozen stage as restricted value-block optimization.
domain assumption Decomposition of attention into query/key block (kernel) and value block (semantic transformation) holds and is useful for federated training
Forms the basis for the two-stage freezing strategy.

pith-pipeline@v0.9.0 · 5537 in / 1602 out tokens · 55937 ms · 2026-05-08T12:42:29.091350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Federated learning based on dynamic regularization.arXiv preprint arXiv:2111.04263, 2021

URL https://arxiv.org/abs/2111.04263. Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Feder- ated learning with personalization layers,

work page arXiv
[2]

Federated Learning with Personalization Layers

URLhttps://arxiv.org/abs/1912.00818. Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn´y, H. Brendan McMa- han, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings,

work page internal anchor Pith review arXiv 1912
[3]

Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou

URL https://arxiv.org/abs/1812.01097. Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christopher D. Manning. ELECTRA: Pre- training text encoders as discriminators rather than generators. InInternational Conference on Learning Representations,

work page arXiv
[4]

Exploiting shared repre- sentations for personalized federated learning

Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared repre- sentations for personalized federated learning. InInternational Conference on Machine Learning (ICML), pages 2089–2099. PMLR,

2089
[5]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186,

2019
[6]

URLhttps://arxiv.org/abs/1909. 06335. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations,

1909
[7]

Qinbin Li, Bingsheng He, and Dawn Song

URLhttps://arxiv.org/ abs/2211.01572. Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10713–10722,

work page arXiv
[8]

Fedbabu: Towards enhanced representa- tion for federated image classification.arXiv preprint arXiv:2106.06042,

URLhttps://arxiv.org/abs/2106.06042. Liangqiong Qu, Yuyin Zhou, Paul Pu Liang, Yingda Xia, Feifei Wang, Ehsan Adeli, Li Fei-Fei, and Daniel Rubin. Rethinking architecture design for tackling data heterogeneity in federated learning,

work page arXiv
[9]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J

URLhttps://arxiv.org/abs/2106.06047. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67,

work page arXiv
[10]

Adaptive federated optimization,

URLhttps: //arxiv.org/abs/2003.00295. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems,

work page arXiv 2003
[11]

Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor

URLhttps://proceedings.neurips.cc/paper_files/paper/2024/file/ b4baac5d3f7508a4eb2b65376470a5a2-Paper-Conference.pdf. Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization.Advances in neural information processing systems, 33:7611–7623,

2024
[12]

11 Appendix In the appendix, we provide the mathematical proofs for all theorems presented in the paper, fol- lowed by additional experimental results

URLhttps://arxiv.org/abs/2408.09101. 11 Appendix In the appendix, we provide the mathematical proofs for all theorems presented in the paper, fol- lowed by additional experimental results. A Proofs A.1 Proof of Theorem 1 Lemma 1.Define𝑍(𝐻;Φ)=𝜑(𝐻𝑊 𝑄 )𝜑(𝐻𝑊 𝐾 )⊤𝐻. Assume Assumption 1 holds. There exists 𝐶1, 𝐶2, 𝐶3, 𝐶4 >0such that: (1.1) ∥𝑍(𝐻;Φ) ∥𝐹 ≤𝐶 1, (1.2...

work page arXiv
[13]

Unless otherwise specified, local training utilizes the AdamW optimizer with a base learning rate of 10 −4 and a weight decay of 0.01

The global model is systematically evaluated on the central test set at the end of every round. Unless otherwise specified, local training utilizes the AdamW optimizer with a base learning rate of 10 −4 and a weight decay of 0.01. Table 5 provides a comprehensive breakdown of these hyperparameter configurations, including algorithm-specific adjustments (e...

2009