SP-CACW: Convergence-Aware Client Weighting for Selfish Personalized Learning

Kfir Y. Levy; Yaron Kiselman

arxiv: 2606.29322 · v1 · pith:4FS5DW4Wnew · submitted 2026-06-28 · 💻 cs.LG

SP-CACW: Convergence-Aware Client Weighting for Selfish Personalized Learning

Yaron Kiselman , Kfir Y. Levy This is my paper

Pith reviewed 2026-06-30 07:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningpersonalized learningclient weightingconvergence analysisnegative transferselfish optimizationaggregation weights

0 comments

The pith

A target client in collaborative learning can weight peer gradients by minimizing an upper bound on its own convergence error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies selfish personalization, where one designated client wants to benefit from peer gradients without negative transfer from mismatched data distributions. It proposes SP-CACW, which chooses aggregation weights to minimize an upper bound on the target client's convergence error. The bound balances bias from dissimilar peers against stochastic gradient variance and permits zero weight on harmful peers. Convergence guarantees are given under smoothness and bounded-variance assumptions. On MNIST, CIFAR-100 and LEAF Shakespeare the method matches or exceeds strong personalized and clustering baselines.

Core claim

SP-CACW selects aggregation weights by minimizing an upper bound on the target client's convergence error. This rule explicitly trades off peer bias against stochastic variance, can assign zero weight to harmful peers, and provides convergence guarantees under smoothness and bounded-variance assumptions.

What carries the argument

The convergence-aware client weighting rule that minimizes an upper bound on the target client's convergence error

If this is right

The weighting rule can assign zero weight to peers whose data would increase the target's error.
Convergence is guaranteed for the target client under standard smoothness and bounded-variance assumptions.
The method is competitive with or improves over strong personalized and clustering baselines on image and text datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bound-minimization idea could be applied to other distributed optimization settings where each participant has a private objective.
Early exclusion of peers via zero weights might reduce communication cost without harming the target client.
The approach might complement or replace clustering methods when data heterogeneity is client-specific rather than group-based.

Load-bearing premise

The upper bound on convergence error is a tight enough proxy that minimizing it improves the target client's actual risk.

What would settle it

An experiment in which the weights chosen by minimizing the bound produce higher test error on the target client's data than uniform averaging or a simple baseline.

read the original abstract

Collaborative learning is sustainable only when it benefits each participant. Standard federated learning optimizes a global average objective, which can under perform for clients whose data distributions differ substantially from the population. We study selfish personalization: how a designated target client can use peer gradients to minimize its own risk while avoiding negative transfer. We propose SP-CACW, a convergence-aware client-weighting framework that selects aggregation weights by minimizing an upper bound on the target client's convergence error. The resulting rule explicitly trades off peer bias against stochastic variance and can assign zero weight to harmful peers. We provide convergence guarantees under smoothness and bounded-variance assumptions and evaluate the method on MNIST, CIFAR-100, and LEAF Shakespeare, where it is competitive with or improves over strong personalized and clustering baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SP-CACW derives client weights for one target by minimizing a convergence-error upper bound that trades bias against variance, but offers no check on whether that bound tracks real risk.

read the letter

The core move is to let a designated client pick aggregation weights by minimizing an explicit upper bound on its own convergence error, built from smoothness and bounded-variance assumptions. The rule can set harmful peers to zero weight and comes with standard convergence guarantees.

What is new is the selfish framing: the weights are chosen solely for one client's convergence bound rather than a global or clustered objective. The bias-variance tradeoff inside the bound is written out clearly, which is cleaner than many heuristic weighting schemes in the personalized FL literature. The experiments on MNIST, CIFAR-100, and LEAF Shakespeare show results that are at least competitive with existing personalization and clustering baselines.

The soft spot is exactly the one the stress-test note flags. The paper never checks whether the bound is tight enough or correlates with the client's actual risk; the guarantees therefore apply only to the proxy. If the constants are loose or the bias/variance estimates from data are noisy, the argmin of the bound can differ from the argmin of true risk, and the empirical gains could be incidental. No diagnostic on bound quality appears in the abstract or method description.

This is for people already working on client weighting or negative transfer in heterogeneous federated learning. A reader who wants an explicit derivation rather than another regularizer will find something usable here. The work is coherent on its own terms and has enough formal and empirical content to justify sending it to referees, though the bound-tightness question will need attention in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes SP-CACW, a convergence-aware client-weighting method for selfish personalized federated learning. A target client selects aggregation weights for peer gradients by minimizing an upper bound on its own convergence error (constructed from smoothness and bounded-variance assumptions), explicitly trading off peer bias against stochastic variance and permitting zero weight on harmful peers. Convergence guarantees are stated under standard assumptions, and the method is evaluated empirically on MNIST, CIFAR-100, and LEAF Shakespeare, where it is competitive with or outperforms strong personalized and clustering baselines.

Significance. If the central claim holds, the work supplies a theoretically motivated weighting rule that lets individual clients avoid negative transfer while retaining convergence guarantees. The explicit bias-variance trade-off and zero-weight capability are concrete strengths; the provision of convergence guarantees under smoothness and bounded-variance assumptions, together with multi-dataset empirical comparisons, adds value for heterogeneous FL settings.

major comments (2)

[§3] §3 (proposed weighting rule) and the paragraph following Eq. (bound expression): the central claim that the argmin of the upper bound produces weights that reduce the target client's true risk (rather than merely the bound) is load-bearing, yet the manuscript provides no analysis of bound tightness, no correlation study between bound value and observed risk, and no comparison of the bound minimizer against the true-risk minimizer. This directly affects whether the rule reliably avoids negative transfer.
[Theorem 1] Theorem 1 (convergence guarantee): the stated rate applies to the sequence generated by the bound-minimizing weights, but without a tightness or approximation-quality result relating the bound to the actual excess risk, it is unclear whether the guarantee transfers to the quantity the client ultimately optimizes.

minor comments (2)

[§3] Notation for bias and variance terms in the bound is introduced without an explicit table or appendix listing all constants and their dependence on data heterogeneity; a compact summary would improve readability.
[Experiments] Figure captions for the empirical results do not state the number of independent runs or report standard deviations; adding these details would strengthen the comparison claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of the bias-variance trade-off and zero-weight mechanism in SP-CACW. We respond to each major comment below and outline planned revisions.

read point-by-point responses

Referee: [§3] §3 (proposed weighting rule) and the paragraph following Eq. (bound expression): the central claim that the argmin of the upper bound produces weights that reduce the target client's true risk (rather than merely the bound) is load-bearing, yet the manuscript provides no analysis of bound tightness, no correlation study between bound value and observed risk, and no comparison of the bound minimizer against the true-risk minimizer. This directly affects whether the rule reliably avoids negative transfer.

Authors: We agree that the manuscript lacks an explicit tightness analysis, correlation study, or direct comparison between the bound minimizer and the true-risk minimizer. The upper bound is constructed from standard smoothness and bounded-variance assumptions, and its minimization yields a closed-form weighting rule that trades off bias and variance while permitting zero weights. However, without additional analysis it remains an open question how closely the bound tracks true excess risk on real data. In the revised manuscript we will add a dedicated paragraph after the bound derivation discussing the conditions (e.g., when stochastic variance dominates or when client distributions are moderately heterogeneous) under which the bound is expected to be reasonably tight, together with a brief remark on the implications for negative-transfer avoidance. We will also include a small synthetic-data experiment illustrating the correlation between bound value and observed risk if space allows. This constitutes a partial revision. revision: partial
Referee: [Theorem 1] Theorem 1 (convergence guarantee): the stated rate applies to the sequence generated by the bound-minimizing weights, but without a tightness or approximation-quality result relating the bound to the actual excess risk, it is unclear whether the guarantee transfers to the quantity the client ultimately optimizes.

Authors: Theorem 1 establishes a convergence rate for the excess risk of the target client when the aggregation weights are chosen to minimize the derived upper bound; because the bound upper-bounds the one-step descent term, the rate holds for that particular sequence under the stated assumptions. We acknowledge that the theorem does not quantify how loose the bound may be or provide an approximation guarantee linking the bound minimizer to the true-risk minimizer. In the revision we will insert a clarifying remark immediately after the theorem statement that explicitly notes the guarantee applies to the bound-minimizing sequence and therefore controls true excess risk only up to the (unspecified) looseness of the bound. This clarification can be added without altering the proof or requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation from error bound is independent

full rationale

The paper derives client weights by minimizing a theoretically obtained upper bound on convergence error, constructed from standard smoothness and bounded-variance assumptions. This follows a conventional bounding-plus-optimization approach in federated learning analysis and does not reduce any claimed result to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. Convergence guarantees are stated to hold for the bound under the listed assumptions; no equations or steps are shown to be equivalent to their inputs by construction. The method is self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard smoothness and bounded-variance assumptions plus the unverified premise that the derived upper bound is a useful surrogate for actual target-client risk.

axioms (1)

domain assumption Gradients satisfy smoothness and have bounded variance
Explicitly invoked for the convergence guarantees stated in the abstract.

pith-pipeline@v0.9.1-grok · 5660 in / 1237 out tokens · 28358 ms · 2026-06-30T07:56:20.797992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Filip Hanzely and Peter Richtárik

doi: 10.1109/TIT.2022.3192506. Filip Hanzely and Peter Richtárik. Federated Learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516, 2020. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007. Kun Hu, Sheng Gong, Qiang Zhang, Chao Seng, Me...

work page doi:10.1109/tit.2022.3192506 2022
[2]

Federated Learning with Non-IID Data

doi: 10.1162/NECO.1994.6.1.147. Karen Simonyan and Andrew Zisserman. V ery deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) , 2015. Jiayi Tan, Y ang Zhou, Guang Liu, Jie-Hong Wang, and Shun Y u. pfedsim: Similarity-aware model aggregation towards personalized federated learning, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1994.6.1.147 1994
[3]

Applying the bounded noise assumption: V ar(Gt(α)) = X i∈[M ] (α(i))2Etkξ(i) t k2 X i∈[M ] (α(i))2(σ(i))2

Variance Term: Assuming independence of the stochastic noise across clients, the variance of the weighted sum is the sum of the individual variances. Applying the bounded noise assumption: V ar(Gt(α)) = X i∈[M ] (α(i))2Etkξ(i) t k2 X i∈[M ] (α(i))2(σ(i))2
[4]

TX t=1 kBαtk2 # 2E

Expectation Term: Because the noise is zero-mean ( Et[ξ(i) t ] = 0 ) and the weights reside on the probability simplex (P i α(i) = 1), the target gradient factors out cleanly: Et[Gt(α)] = X i∈[M ] α(i) rf (0)(wt) + b(i) t =rf (0)(wt) + X i∈[M ] α(i)b(i) t Applying the elementary inequalitykx + yk2 2kxk2 + 2kyk2, we bound the squared norm of this expectati...

2023
[5]

Target Cluster Size: We focus on a client in the target cluster I0, so we set ni = n0
[6]

No Malicious Clients: We set βi = 0, which eliminates the robustness term βiσ∆
[7]

Substituting these values into the general theorem yields: errFC T ≲ r σ2/n0 + σ3/∆ T = s σ2 n0T + σ3 ∆T

Simpliﬁcation: We treat constants A as O(1). Substituting these values into the general theorem yields: errFC T ≲ r σ2/n0 + σ3/∆ T = s σ2 n0T + σ3 ∆T . To establish an idealized baseline, this lemma applies the general bound from Theorem 4.2, though the formal proof of that theorem is deferred to Appendix D.3. Readers may prefer to complete the appendix t...
[8]

By standard matrix calculus for quadratic forms, the gradient is: rLt(α) = 2 Htα and the Hessian is the constant matrix: r2Lt(α) = 2 Ht

Formulating the Exp-Concavity Condition To prove our function Lt(α) is γ-exp-concave, we compute its ﬁrst and second derivatives. By standard matrix calculus for quadratic forms, the gradient is: rLt(α) = 2 Htα and the Hessian is the constant matrix: r2Lt(α) = 2 Ht. Substituting these into the deﬁnition, we require: 2Ht γ(2Htα)(2Htα)⊤ = 4γHtαα⊤Ht
[9]

Projecting both sides onto v yields: 2(v⊤Htv) 4γ(v⊤Htα)(α⊤Htv)

V eriﬁcation via Induced Inner Products By the deﬁnition of the Loewner order, A B if and only if v⊤Av v⊤Bv for any arbitrary vector v2 RM . Projecting both sides onto v yields: 2(v⊤Htv) 4γ(v⊤Htα)(α⊤Htv). 20 Because Ht is symmetric and PSD, it deﬁnes a valid induced inner product spacehx, yiH = x⊤Hty with squared normkxk2 H = x⊤Htx. Rewriting our inequali...
[10]

Follow the Leader

Bounding the Maximum Loss and Gradient BecauseLt(α) is convex over the probability simplex ∆M , its maximum occurs at a vertex. To rigorously bound this maximum, we introduce the explicit assumption that the local gradients and true gradients are almost surely bounded such that kg(i) t k gmax andkrf (i)(wt)k gmax for all i. Evaluating the loss at a vertex...
[11]

Combining this with Lemma D.1 (which provides the bound on the gradient norm GL of the loss function) completes the proof. E Extension to Partial Participation While the primary analysis of SP-CACW is presented under the assumption of full participation, the framework naturally extends to partial participation settings via a standard mathematical reductio...
[12]

Unbiased-ness: EZ[ ˜Gt(αt)] =PM −1 i=0 α(i) t g(i) t
[13]

V ariance Bound: V arZ( ˜Gt(αt)) = 1 p 1 PM −1 i=0 (α(i) t )2kg(i) t k2 where EZ and V arZ denote the expectation and variance over the sampling indicators zi,t. Proof. 1. Unbiased-ness: By linearity of expectation and the deﬁnition of the Bernoulli variable ( EZ[zi,t] = p): EZ[ ˜Gt(αt)] = M −1X i=0 α(i) t EZ " g(i) t p zi,t # = M −1X i=0 α(i) t g(i) t
[14]

Variance Bound: By deﬁnition, the variance of the sum of independent random variables is the sum of their variances. Since the client sampling indicators are independent across the network, we have: V arZ( ˜Gt(αt)) = M −1X i=0 (α(i) t )2V arZ g(i) t p zi,t ! 23 For a Bernoulli random variable zi,t with parameter p, the variance is V arZ(zi,t) = p(1p). Pul...
[15]

Cubic Regularization. To prevent the algorithm from prematurely collapsing into a trivial solu- tion (e.g., placing probability 1 on the target client or a single peer), we introduce a regularization term. We select a cubic penalty P(α(i))3 rather than the standard quadratic ( L2) penalty. Cubic regularization provides a ﬂatter gradient near zero, allowin...
[16]

To maintain computational efﬁciency, we approximate the spectral norm of the Hessian via Power Iteration, computing Hessian-vector products using the method of Pearlmutter [1994]

Lipchitz Constant Estimation: The smoothness constant L is estimated dynamically us- ing local curvature information. To maintain computational efﬁciency, we approximate the spectral norm of the Hessian via Power Iteration, computing Hessian-vector products using the method of Pearlmutter [1994]. This avoids materializing the full Hessian matrix while pro...

1994
[17]

We set λreg = γ k var(g(0) t )k2 (where γ = 1 ), en- suring the penalty naturally decays as the model approaches convergence and the gradient magnitude diminishes

Dynamic Regularization: The regularization coefﬁcient λreg is adaptive, scaling with the signal strength of the target client. We set λreg = γ k var(g(0) t )k2 (where γ = 1 ), en- suring the penalty naturally decays as the model approaches convergence and the gradient magnitude diminishes
[18]

This allows us to optimize the objective while strictly enforcing the simplex constraint P α(i) = 1, α(i) 0 via Euclidean projection

Numerical Solver (PGD): To efﬁciently ﬁnd the optimal weights α at each step, we em- ploy Projected Gradient Descent (PGD) . This allows us to optimize the objective while strictly enforcing the simplex constraint P α(i) = 1, α(i) 0 via Euclidean projection. 24
[19]

In our exper- iments, we set K to correspond to one full pass over the local data (one epoch)

Lazy Weight Updates (Epoch-Based): To improve stability and reduce computational cost, we update the aggregation weights αt only once every K iterations. In our exper- iments, we set K to correspond to one full pass over the local data (one epoch). This strategy allows the bias estimates to accumulate sufﬁcient statistics from the entire data dis- tributi...
[20]

Ccomm = O(d)|{z} Unlink Broadcast + O(M d)| {z } Down link Reception (Line 3) =O(M d)

Communication Cost ( Ccomm): The only network exchange occurs when the server broadcasts the model and receives the local stochastic gradients g(i) t from the active peers (Line 3). Ccomm = O(d)|{z} Unlink Broadcast + O(M d)| {z } Down link Reception (Line 3) =O(M d)
[21]

primary signal

Computational Cost ( Ccomp): Once the local gradients are received, all subsequent personalization and tracking mechanics are executed entirely in-memory via linear vector operations. The computation breaks down into the bias EMA tracking (Line 4), the weighted gradient aggregation (Line 5), the model update (Line 6), and the FTAL weight optimization via ...

2015
[22]

Intra-Cluster Heterogeneity: It varies the distributions of the primary classes, ensuring clients in the same cluster are not identical. 27
[23]

background noise

Inter-Cluster Leakage: It introduces small, non-zero probabilities for classes outside the client’s assigned cluster. This breaks the strict disjoint assumption, creating a "background noise" level that tests the algorithm’s robustness to irrelevant data. Algorithm 3 CIFAR-100 Partitioning (Latent Clusters) Require: N (Clients), C (Clusters), M (Clients p...

[1] [1]

Filip Hanzely and Peter Richtárik

doi: 10.1109/TIT.2022.3192506. Filip Hanzely and Peter Richtárik. Federated Learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516, 2020. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007. Kun Hu, Sheng Gong, Qiang Zhang, Chao Seng, Me...

work page doi:10.1109/tit.2022.3192506 2022

[2] [2]

Federated Learning with Non-IID Data

doi: 10.1162/NECO.1994.6.1.147. Karen Simonyan and Andrew Zisserman. V ery deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) , 2015. Jiayi Tan, Y ang Zhou, Guang Liu, Jie-Hong Wang, and Shun Y u. pfedsim: Similarity-aware model aggregation towards personalized federated learning, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1994.6.1.147 1994

[3] [3]

Applying the bounded noise assumption: V ar(Gt(α)) = X i∈[M ] (α(i))2Etkξ(i) t k2 X i∈[M ] (α(i))2(σ(i))2

Variance Term: Assuming independence of the stochastic noise across clients, the variance of the weighted sum is the sum of the individual variances. Applying the bounded noise assumption: V ar(Gt(α)) = X i∈[M ] (α(i))2Etkξ(i) t k2 X i∈[M ] (α(i))2(σ(i))2

[4] [4]

TX t=1 kBαtk2 # 2E

Expectation Term: Because the noise is zero-mean ( Et[ξ(i) t ] = 0 ) and the weights reside on the probability simplex (P i α(i) = 1), the target gradient factors out cleanly: Et[Gt(α)] = X i∈[M ] α(i) rf (0)(wt) + b(i) t =rf (0)(wt) + X i∈[M ] α(i)b(i) t Applying the elementary inequalitykx + yk2 2kxk2 + 2kyk2, we bound the squared norm of this expectati...

2023

[5] [5]

Target Cluster Size: We focus on a client in the target cluster I0, so we set ni = n0

[6] [6]

No Malicious Clients: We set βi = 0, which eliminates the robustness term βiσ∆

[7] [7]

Substituting these values into the general theorem yields: errFC T ≲ r σ2/n0 + σ3/∆ T = s σ2 n0T + σ3 ∆T

Simpliﬁcation: We treat constants A as O(1). Substituting these values into the general theorem yields: errFC T ≲ r σ2/n0 + σ3/∆ T = s σ2 n0T + σ3 ∆T . To establish an idealized baseline, this lemma applies the general bound from Theorem 4.2, though the formal proof of that theorem is deferred to Appendix D.3. Readers may prefer to complete the appendix t...

[8] [8]

By standard matrix calculus for quadratic forms, the gradient is: rLt(α) = 2 Htα and the Hessian is the constant matrix: r2Lt(α) = 2 Ht

Formulating the Exp-Concavity Condition To prove our function Lt(α) is γ-exp-concave, we compute its ﬁrst and second derivatives. By standard matrix calculus for quadratic forms, the gradient is: rLt(α) = 2 Htα and the Hessian is the constant matrix: r2Lt(α) = 2 Ht. Substituting these into the deﬁnition, we require: 2Ht γ(2Htα)(2Htα)⊤ = 4γHtαα⊤Ht

[9] [9]

Projecting both sides onto v yields: 2(v⊤Htv) 4γ(v⊤Htα)(α⊤Htv)

V eriﬁcation via Induced Inner Products By the deﬁnition of the Loewner order, A B if and only if v⊤Av v⊤Bv for any arbitrary vector v2 RM . Projecting both sides onto v yields: 2(v⊤Htv) 4γ(v⊤Htα)(α⊤Htv). 20 Because Ht is symmetric and PSD, it deﬁnes a valid induced inner product spacehx, yiH = x⊤Hty with squared normkxk2 H = x⊤Htx. Rewriting our inequali...

[10] [10]

Follow the Leader

Bounding the Maximum Loss and Gradient BecauseLt(α) is convex over the probability simplex ∆M , its maximum occurs at a vertex. To rigorously bound this maximum, we introduce the explicit assumption that the local gradients and true gradients are almost surely bounded such that kg(i) t k gmax andkrf (i)(wt)k gmax for all i. Evaluating the loss at a vertex...

[11] [11]

Combining this with Lemma D.1 (which provides the bound on the gradient norm GL of the loss function) completes the proof. E Extension to Partial Participation While the primary analysis of SP-CACW is presented under the assumption of full participation, the framework naturally extends to partial participation settings via a standard mathematical reductio...

[12] [12]

Unbiased-ness: EZ[ ˜Gt(αt)] =PM −1 i=0 α(i) t g(i) t

[13] [13]

V ariance Bound: V arZ( ˜Gt(αt)) = 1 p 1 PM −1 i=0 (α(i) t )2kg(i) t k2 where EZ and V arZ denote the expectation and variance over the sampling indicators zi,t. Proof. 1. Unbiased-ness: By linearity of expectation and the deﬁnition of the Bernoulli variable ( EZ[zi,t] = p): EZ[ ˜Gt(αt)] = M −1X i=0 α(i) t EZ " g(i) t p zi,t # = M −1X i=0 α(i) t g(i) t

[14] [14]

Variance Bound: By deﬁnition, the variance of the sum of independent random variables is the sum of their variances. Since the client sampling indicators are independent across the network, we have: V arZ( ˜Gt(αt)) = M −1X i=0 (α(i) t )2V arZ g(i) t p zi,t ! 23 For a Bernoulli random variable zi,t with parameter p, the variance is V arZ(zi,t) = p(1p). Pul...

[15] [15]

Cubic Regularization. To prevent the algorithm from prematurely collapsing into a trivial solu- tion (e.g., placing probability 1 on the target client or a single peer), we introduce a regularization term. We select a cubic penalty P(α(i))3 rather than the standard quadratic ( L2) penalty. Cubic regularization provides a ﬂatter gradient near zero, allowin...

[16] [16]

To maintain computational efﬁciency, we approximate the spectral norm of the Hessian via Power Iteration, computing Hessian-vector products using the method of Pearlmutter [1994]

Lipchitz Constant Estimation: The smoothness constant L is estimated dynamically us- ing local curvature information. To maintain computational efﬁciency, we approximate the spectral norm of the Hessian via Power Iteration, computing Hessian-vector products using the method of Pearlmutter [1994]. This avoids materializing the full Hessian matrix while pro...

1994

[17] [17]

We set λreg = γ k var(g(0) t )k2 (where γ = 1 ), en- suring the penalty naturally decays as the model approaches convergence and the gradient magnitude diminishes

Dynamic Regularization: The regularization coefﬁcient λreg is adaptive, scaling with the signal strength of the target client. We set λreg = γ k var(g(0) t )k2 (where γ = 1 ), en- suring the penalty naturally decays as the model approaches convergence and the gradient magnitude diminishes

[18] [18]

This allows us to optimize the objective while strictly enforcing the simplex constraint P α(i) = 1, α(i) 0 via Euclidean projection

Numerical Solver (PGD): To efﬁciently ﬁnd the optimal weights α at each step, we em- ploy Projected Gradient Descent (PGD) . This allows us to optimize the objective while strictly enforcing the simplex constraint P α(i) = 1, α(i) 0 via Euclidean projection. 24

[19] [19]

In our exper- iments, we set K to correspond to one full pass over the local data (one epoch)

Lazy Weight Updates (Epoch-Based): To improve stability and reduce computational cost, we update the aggregation weights αt only once every K iterations. In our exper- iments, we set K to correspond to one full pass over the local data (one epoch). This strategy allows the bias estimates to accumulate sufﬁcient statistics from the entire data dis- tributi...

[20] [20]

Ccomm = O(d)|{z} Unlink Broadcast + O(M d)| {z } Down link Reception (Line 3) =O(M d)

Communication Cost ( Ccomm): The only network exchange occurs when the server broadcasts the model and receives the local stochastic gradients g(i) t from the active peers (Line 3). Ccomm = O(d)|{z} Unlink Broadcast + O(M d)| {z } Down link Reception (Line 3) =O(M d)

[21] [21]

primary signal

Computational Cost ( Ccomp): Once the local gradients are received, all subsequent personalization and tracking mechanics are executed entirely in-memory via linear vector operations. The computation breaks down into the bias EMA tracking (Line 4), the weighted gradient aggregation (Line 5), the model update (Line 6), and the FTAL weight optimization via ...

2015

[22] [22]

Intra-Cluster Heterogeneity: It varies the distributions of the primary classes, ensuring clients in the same cluster are not identical. 27

[23] [23]

background noise

Inter-Cluster Leakage: It introduces small, non-zero probabilities for classes outside the client’s assigned cluster. This breaks the strict disjoint assumption, creating a "background noise" level that tests the algorithm’s robustness to irrelevant data. Algorithm 3 CIFAR-100 Partitioning (Latent Clusters) Require: N (Clients), C (Clusters), M (Clients p...