arxiv: 2603.23472 · v2 · submitted 2026-03-24 · 💻 cs.LG · cs.CR· math.OC

Recognition: 2 theorem links

· Lean Theorem

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Rustem Islamov , Grigory Malinovsky , Alexander Gaponov , Aurelien Lucchi , Peter Richt\'arik , Eduard Gorbunov

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CRmath.OC

keywords Byzantine robustnessdifferential privacyfederated learningconvergence analysisrobust aggregationclippingdouble momentumstochastic optimization

0 comments

The pith

A new federated algorithm called Byz-Clip21-SGD2M achieves high-probability convergence under L-smoothness and sub-Gaussian noise even with Byzantine attacks and differential privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Byz-Clip21-SGD2M, which combines robust aggregation, double momentum, and clipping to train models across heterogeneous clients while resisting malicious server manipulations and satisfying privacy requirements. It establishes high-probability convergence guarantees using only the standard assumptions that the loss is L-smooth and gradient noise is sub-Gaussian, conditions weaker than the bounded gradients required by most prior analyses. Without attacks the method recovers state-of-the-art rates; with attacks and privacy constraints it delivers stronger utility bounds. This matters because federated systems can now be analyzed under assumptions that hold for many practical losses and noise models rather than artificial restrictions.

Core claim

Byz-Clip21-SGD2M integrates robust aggregation with double momentum and carefully designed clipping. Under L-smoothness and σ-sub-Gaussian gradient noise it proves high-probability convergence that recovers optimal rates in the absence of adversaries and improves utility when Byzantine attacks and differential privacy are present.

What carries the argument

Byz-Clip21-SGD2M, which performs robust aggregation of clipped and double-momentum updates to maintain convergence despite adversarial manipulation and privacy noise.

If this is right

Convergence rates match the best known rates for non-adversarial federated SGD when no attacks occur.
Utility bounds improve over prior Byzantine-robust or DP methods when both attacks and privacy constraints are active.
High-probability guarantees apply directly to heterogeneous client data without extra bounded-gradient assumptions.
Empirical results on MNIST with CNN and MLP models confirm practical effectiveness under the stated conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other distributed settings where servers are untrusted but data heterogeneity is high.
Relaxing bounded-gradient requirements may allow analysis of larger models or non-convex losses common in deep learning.
Implementations could combine this clipping and momentum design with existing secure aggregation protocols for end-to-end privacy.
The same machinery might yield tighter bounds for other robust aggregation rules beyond the one used here.

Load-bearing premise

The loss function stays L-smooth and the gradient noise remains σ-sub-Gaussian even after Byzantine manipulation and clipping.

What would settle it

An explicit loss function that is L-smooth with sub-Gaussian noise for which the algorithm diverges with positive probability under a valid Byzantine attack would disprove the convergence claim.

Figures

Figures reproduced from arXiv: 2603.23472 by Alexander Gaponov, Aurelien Lucchi, Eduard Gorbunov, Grigory Malinovsky, Peter Richt\'arik, Rustem Islamov.

**Figure 1.** Figure 1: Performance of Byz-Clip21-SGD2M, Byz-Clip-SGD (Algorithm 3), and Safe-DSHB (Algorithm 4) when training CNN (top line) and MLP (bottom line) models on the MNIST dataset for different numbers of Byzantine clients and privacy budgets, when Byzantine clients use IPM attack. In Corollary 5.1, the first term corresponds to the error arising from privacy, while the last term reflects the error introduced by the … view at source ↗

**Figure 2.** Figure 2: Performance of Byz-Clip21-SGD2M, Byz-Clip-SGD (Algorithm 3), and Safe-DSHB (Algorithm 4) when training CNN (top line) and MLP (bottom line) models on the MNIST dataset for different numbers of Byzantine clients and privacy budgets, when Byzantine clients use label flipping attack. Absence of DP and Byzantine Adversaries. Setting σω = 0 and δbyz = 0 significantly simplifies the choice of hyperparameters. I… view at source ↗

read the original abstract

Federated Learning (FL) enables heterogeneous clients to collaboratively train a shared model without centralizing their raw data, offering an inherent level of privacy. However, gradients and model updates can still leak sensitive information, while malicious servers may mount adversarial attacks such as Byzantine manipulation. These vulnerabilities highlight the need to address differential privacy (DP) and Byzantine robustness within a unified framework. Existing approaches, however, often rely on unrealistic assumptions such as bounded gradients, require auxiliary server-side datasets, or fail to provide convergence guarantees. We address these limitations by proposing Byz-Clip21-SGD2M, a new algorithm that integrates robust aggregation with double momentum and carefully designed clipping. We prove high-probability convergence guarantees under standard $L$-smoothness and $\sigma$-sub-Gaussian gradient noise assumptions, thereby relaxing conditions that dominate prior work. Our analysis recovers state-of-the-art convergence rates in the absence of adversaries and improves utility guarantees under Byzantine and DP settings. Empirical evaluations on CNN and MLP models trained on MNIST further validate the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies Byzantine robustness and DP via Byz-Clip21-SGD2M with double momentum and clipping, claiming high-probability rates under L-smoothness plus sub-Gaussian noise instead of bounded gradients.

read the letter

The main takeaway is that this work gives a single algorithm, Byz-Clip21-SGD2M, that adds robust aggregation, double momentum, and designed clipping to handle both Byzantine attacks and differential privacy in federated learning. It proves high-probability convergence under standard L-smoothness and sigma-sub-Gaussian gradient noise, which relaxes the bounded-gradient assumption that shows up in most prior results. The analysis recovers the usual rates when there are no adversaries and gives improved utility bounds when both threats are present. Experiments on MNIST with CNN and MLP models provide a basic practical check. That combination is the concrete advance. The paper does well by avoiding auxiliary server data and by keeping the assumptions close to what people already use in non-adversarial FL. The citation pattern is standard and builds on existing robust and private methods without obvious circularity. The soft spot is the noise assumption after clipping and aggregation. Clipping truncates tails and the robust aggregator can shift the output distribution, so it is not immediate that the effective noise stays sub-Gaussian once Byzantine clients or heterogeneous data are involved. The high-probability bounds rest on that invariance holding, and the abstract does not spell out the step in detail. If the full proof closes the gap cleanly, the claims stand; otherwise the rates may need adjustment. Experiments are limited to MNIST, which is fine for a first check but leaves open questions on larger tasks. This paper is for researchers working on secure federated optimization who want weaker assumptions and a unified defense. A reader who follows convergence analyses in distributed settings will get usable rates and an algorithm to build on. It has enough substance and specificity to deserve a serious referee, even if the noise step needs tightening.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Byz-Clip21-SGD2M, a federated optimization algorithm integrating clipping (for DP and robustness), double momentum, and a robust aggregator. It claims high-probability convergence guarantees under standard L-smoothness and σ-sub-Gaussian gradient noise assumptions, recovering state-of-the-art rates without adversaries while improving utility under Byzantine and DP settings, with validation via CNN/MLP experiments on MNIST.

Significance. If the analysis holds, the work would be significant for relaxing the bounded-gradient assumptions prevalent in prior Byzantine-robust and differentially private federated learning literature, providing a unified framework with explicit high-probability bounds and empirical support on standard benchmarks.

major comments (1)

[Abstract / Theoretical Analysis] Abstract and theoretical analysis: The high-probability convergence claims rest on the assumption that gradient noise remains σ-sub-Gaussian after clipping and double-momentum robust aggregation, even under Byzantine clients and heterogeneous data. Clipping truncates tails and Byzantine manipulation can alter the effective distribution around differing per-client means, so the invariance is not immediate from the stated L-smoothness and sub-Gaussian assumptions alone; explicit justification or a counterexample check is required for the central bounds to apply.

minor comments (2)

[Experiments] Experiments section: Report error bars across multiple runs and specify how the clipping threshold (a free parameter) is selected or tuned, to strengthen reproducibility of the MNIST results.
[Algorithm Description] Notation: Clarify the exact form of the double-momentum update and the robust aggregator (e.g., coordinate-wise median or trimmed mean) with equation numbers for direct comparison to prior work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Theoretical Analysis] Abstract and theoretical analysis: The high-probability convergence claims rest on the assumption that gradient noise remains σ-sub-Gaussian after clipping and double-momentum robust aggregation, even under Byzantine clients and heterogeneous data. Clipping truncates tails and Byzantine manipulation can alter the effective distribution around differing per-client means, so the invariance is not immediate from the stated L-smoothness and sub-Gaussian assumptions alone; explicit justification or a counterexample check is required for the central bounds to apply.

Authors: We agree that the preservation of sub-Gaussianity after clipping and robust aggregation is not immediate and merits explicit treatment. Clipping at a threshold proportional to the noise scale preserves sub-Gaussian tails up to a constant factor (by standard concentration arguments for truncated sub-Gaussian random variables), while the double-momentum update and robust aggregator (e.g., coordinate-wise median or trimmed mean) yield an effective noise term whose deviation can be bounded via vector concentration inequalities that continue to hold under per-client heterogeneity. We will insert a new supporting lemma (and its proof) in the appendix that derives the adjusted sub-Gaussian parameter for the aggregated gradient under Byzantine corruption, thereby making the high-probability bounds fully rigorous. This addition does not change the main theorem statements or rates. revision: yes

Circularity Check

0 steps flagged

No circularity: proof rests on independent standard assumptions

full rationale

The paper states a convergence analysis under L-smoothness and σ-sub-Gaussian gradient noise. No derivation step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain. The central claim is a high-probability bound derived from these external assumptions rather than from the algorithm's outputs or prior author results. The skeptic concern about post-clipping noise distribution is a question of assumption validity, not circularity in the derivation itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard optimization assumptions plus the effectiveness of the new clipping and momentum mechanisms; no new entities are postulated.

free parameters (2)

clipping threshold
Carefully designed clipping parameter that balances robustness, privacy, and convergence; value is chosen to satisfy the stated utility bounds.
momentum coefficients
Double-momentum parameters that must be set to achieve the claimed acceleration and stability.

axioms (2)

domain assumption The objective function is L-smooth
Invoked throughout the convergence analysis to control gradient differences.
domain assumption Gradient noise is σ-sub-Gaussian
Weaker noise assumption replacing the bounded-gradient condition used in prior Byzantine-robust work.

pith-pipeline@v0.9.0 · 5501 in / 1485 out tokens · 47198 ms · 2026-05-14T23:59:36.843344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove high-probability convergence guarantees under standard L-smoothness and σ-sub-Gaussian gradient noise assumptions... Byz-Clip21-SGD2M... clipping... robust aggregation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

ˆβ≤min {√ L∆ a , √ L∆ 4 √ 2cδbyzˆa,1 } ; 4.∥mt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3ˆβa+ˆβ √ 2cδbyzˆa+ 3 √ 2cδbyz ˆβτT, 5.∥gt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3 √ 2cδbyz ˆβτT; 6.∥∇fi(xt−1)−vt−1 i ∥≤ √ 4L∆ + 3 2(Binit−τ) +3 2b+ ˆβa+ 4ˆβ √ 2cδbyzˆa+3 2 √ 2cδbyz ˆβτTfor all i∈G; 7.∥vt i−gt−1 i ∥≤Binit for alli∈G; 8.∥Ωt∥≤a; 9.∥Ωt−1 i ∥≤ˆafor alli∈G; 10.∥θt i∥≤bfor a...

work page
[2]

ˆβ∈min {√ L∆ a , √ L∆ 4 √ 2cδbyzˆa, 1 24 } ; 7.∥mt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3ˆβa+ˆβ √ 2cδbyzˆa+ 3 √ 2cδbyz ˆβτT, 8.∥gt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3 √ 2cδbyz ˆβτT; 9.∥∇fi(xt−1)−vt−1 i ∥≤ √ 4L∆ + 3 2(Binit−τ) +3 2b+ ˆβa+ 4ˆβ √ 2cδbyzˆa+3 2 √ 2cδbyz ˆβτTfor all i∈G; 10.∥Ωt i∥≤ˆafor alli∈G; 11.∥θt i∥≤band∥θt+1 i ∥≤bfor alli∈G. Then we have ∥vt+1 i −...

work page
[3]

We have already shown that P    1 G ∑ i∈G ω1 i  ≥a  ≤ α 8(T+ 1) , implying that  1 G ∑ i∈Gω1 i ≤awith probability at least1−α 8(T+1).Similarly, we have shown that∥∑1 l=1ωl i∥≤ˆawith probability1−α 8G(T+1).Therefore,

work page
[4]

Note that if one of the maximum terms in the last sum is actually zero, then the bound is easier to satisfy

g0 = 1 G ∑ i∈G(g−1 i + ˆβclipτ(v0 i−g−1 i ) = 1 G ∑ i∈Gˆβclipτ(β∇fi(x0,ξ0 i )).Therefore, we have ∥g0∥=  1 G ∑ i∈G ˆββ∇fi(x0) + ˆββθ0 i + (ˆβclipτ(β∇fi(x0,ξ0 i ))−ˆββ∇fi(x0,ξ0 i ))  ≤ˆββ∥∇f(x0)∥+ ˆββ G ∑ i∈G ∥θ0 i∥+1 G ∑ i∈G max { 0,β∥∇fi(x0,ξ0 i )∥−τ } ≤ˆββ √ 2L(f(x0)−f(x⋆)) + ˆββ G ∑ i∈G ∥θ0 i∥+ ˆβ G ∑ i∈G max { 0,β∥∇fi(x0)∥+β∥θ0 i∥−τ } . No...

work page
[5]

Therefore, the bound5of the assumption of the induction is verified

We have ∥v0 i−∇fi(x0)∥=∥β∇fi(x0,ξ0 i )−∇fi(x0)∥ ≤β∥∇fi(x0,ξ0 i )−∇fi(x0)∥+(1−β)∥∇fi(x0)∥ ≤βb+ (1−β)Binit The bound above holds with probability at least1− α 8(T+1) because it holds in∩i∈GΘ 0 i. Therefore, the bound5of the assumption of the induction is verified

work page
[6]

Next, we emphasize that the condition8of the induction assumption also hold, asΦ 0 ≤ 2Φ 0≤2∆by the choice of∆

work page
[7]

We finalize the induction base by noting that the condition9of the induction assumption holds since the RHS equals0. Therefore, we conclude that the conditions1-8hold with a probability of at least P ( Θ 0∩ ( ∩i∈GΘ 0 i ) ∩Nt∩ ( ∩i∈GNt i )) ≥1−P(Θ0)− ∑ i∈G P(Θ 0 i )−P(N0)− ∑ i∈G P(N 0 i ) ≥1−α 8(T+ 1) −G·α 8G(T+ 1) − α 8(T+ 1) −G·α 8G(T+ 1) = 1− α 2(T+ 1) ...

work page
[8]

Moreover, by the definition ofEt we get that the eventEK∩ΘK+1∩ ( ∩i∈GΘK+1 i ) ∩NK+1∩( ∩i∈GNK+1 i ) implies ζt 1,i =g t i−vt i, ζt 2,i =v t i−∇fi(xt), ζt 3,i =∇fi(xt)−∇fi(xt+1), ζt 4 =v t−∇f(xt), ζt 5 =∇f(xt)−∇f(xt+1). Therefore, the eventEK∩ΘK+1∩ ( ∩i∈GΘK+1 i ) ∩NK+1∩ ( ∩i∈GNK+1 i ) implies ΦK+1≤Φ0 +Kb 2 ( 2β2γ ˆβη + 8γβ3 ˆβ2η2 ) +Kˆc2 2γβ G    ① + 4γ...

work page
[9]

Moreover, this also implies that ΦK+1≤Φ0 + ∆≤∆ + ∆ = 2∆, i.e., condition6in the induction assumption holds

This implies that ①+②+③+④+⑤+⑥+⑦+⑧+⑨+⑩≤7· ∆ 8 + 3·∆ 24 = ∆, i.e., condition7in the induction assumption holds. Moreover, this also implies that ΦK+1≤Φ0 + ∆≤∆ + ∆ = 2∆, i.e., condition6in the induction assumption holds. The probabilityP(EK+1)can be lower bounded as follows P(EK+1)≥P(Ω) =P ( EK∩ΘK+1∩ ( ∩i∈GΘK+1 i ) ∩NK+1∩ ( ∩i∈GNK+1 i ) ∩E①∩E②∩E③∩E④∩E⑤∩E⑥ ∩E...

work page
[10]

ˆβ∼ √ L∆ aC , wherea∼˜O( √ dσω √ T/ √ G), gives the term in the rate of the form L∆ T ˜O   ( Tσ2 L∆ ˆβ2η2 )1/4 + ( Binitσ √ T L∆ √ G ˆβη )1/2 + ( σ( √ L∆ +B init +σ) √ T L∆ √ G ˆβ2η2 )1/3 + ( σ √ cδbyzτT3/2 L∆ √ G ˆβη )1/3  = L∆ T ˜O   ( B2 initC2a2Tσ2 (L∆) 2τ2 )1/4 + ( B2 initCaσT1/2 (L∆) 3/2√ Gτ )1/2 + ( B2 initC2a2σ( √ L∆ +B init +σ) √ T (L∆) 2√ ...

work page
[11]

ˆβ∼ √ L∆ C1a2/3T 1/3τ1/3, whereC 1 = 1 +G1/3(cδbyz)1/3 (we combine cases 2 and 3 in (84) together to simplify calculations). We obtain L∆ T ˜O   ( Tσ2 L∆ ˆβ2η2 )1/4 + ( Binitσ √ T L∆ √ G ˆβη )1/2 + ( σ( √ L∆ +B init +σ) √ T L∆ √ G ˆβ2η2 )1/3 + ( σ √ cδbyzτT3/2 L∆ √ G ˆβη )1/3  = ˜O   ( (L∆) 2B2 initC2 1a4/3σ2 T 7/3τ4/3 )1/4 + (√ L∆B 2 initC1a2/3σ T ...

work page
[12]

We keep only the last two terms, since they have the worst dependence onT

ˆβ∼ √ L∆ (cδbyz)1/3τT, gives the term in the rate of the form L∆ T ˜O   ( Tσ2 L∆ ˆβ2η2 )1/4 + ( Binitσ √ T L∆ √ G ˆβη )1/2 + ( σ( √ L∆ +B init +σ) √ T L∆ √ G ˆβ2η2 )1/3 + ( σ √ cδbyzτT3/2 L∆ √ G ˆβη )1/3  = ˜O   ( (L∆) 2(cδbyz)2/3B2 initσ2 T )1/4 + ( (L∆) 1/2B2 init(cδbyz)1/3σ√ GT )1/2 + ( L∆B 2 init(cδbyz)2/3σ( √ L∆ +B init +σ)√ TG )1/3 + ( (cδbyz)...

work page 2014
[13]

1 G ∑ i∈G∥vt−1−vt−1 i ∥2≤2ζ2 + 2β˜z2; 58 Then we have f(xt+1)≤f(xt)−γ 2∥∇f(xt)∥2−1 4γ∥xt+1−xt∥2+γ∥∇f(xt)−vt∥2+2cδbyzζ2 + 2cδbyzβ˜z2.(100) Proof.Using the derivations from [Islamov et al., 2025b], Lemma 2 we first get f(xt+1)≤f(xt)−γ 2∥∇f(xt)∥2−γ 4∥gt∥2+γ 2∥∇f(xt)−gt∥2. We continue the derivations as follows f(xt+1) (i) ≤f(xt)−γ 2∥∇f(xt)∥2−γ 4∥gt∥2+γ∥∇f(xt...

work page
[14]

1 G ∑ i∈G∥vt−1−vt−1 i ∥2≤2cδbyzζ2 + 2cδbyzβ˜z2, 7.Φ t−1≤2˜∆. Then we have ∥vt∥≤ √ 64L ˜∆ + 3 ˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √ β˜z.(103) Proof.We start as follows ∥vt∥ (i) =∥(1−β)vt−1+β∇f(xt,ξt)∥ =∥(1−β)(vt−1−∇f(xt−1)) +β(∇f(xt)−∇f(xt−1)) +β(∇f(xt,ξt)−∇f(xt)) +∇f(xt−1)∥ (ii) ≤(1−β)∥vt−1−∇f(xt−1)∥+βLγ∥gt−1∥+β∥θt∥+∥∇f(xt−1)∥.(104) 59 where(i)follows from th...

work page
[15]

1 G ∑ i∈G∥vt−1 i −vt−1∥2≤2cδbyzζ2 + 2cδbyzβ˜z2; 5.∥vt−1∥≤ √ 64L ˜∆ + 3 ˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √β˜z, 6.∥∇f(xt−1)−vt−1∥≤ √ 4L ˜∆ + 3 2 ˜c√ G + 3 2 √ 2cδbyzζ+3 2 √ 2cδbyz √β˜z; 7.∥θt∥≤˜c√ G; 8.∥θt i∥≤˜bfor alli∈G; 60 9.Φ t−1≤2˜∆. Then we have ∥∇f(xt)−vt∥≤ √ 4L ˜∆ + 3 2 ˜c√ G + 3 2 √ 2cδbyzζ+3 2 √ 2cδbyz √ β˜z.(107) Proof.We have ∥∇f(xt)−vt∥ (i) =∥∇f...

work page
[16]

and momentum restrictions defined in(112),(113), and(114). Then, with probability at least1−α, we bound1 T ∑T−1 t=0∥∇f(xt)∥2 with ˜O  L ˜∆ T + ( σ2L ˜∆ GT )1/2 + σ( √ L ˜∆ +σ/ √ G+ √ cδbyzζ+ √ cδbyzσ)√ GT +cδbyzζ2  ,(109) where we can choose˜∆ = 2(f(x 0)−f⋆) = 2F 0. Proof.For convenience, we define∇fi(x−1,ξ−1 i ) =v −1 i =g−1 i = 0,Φ −1= Φ 0. Next, le...

work page
[17]

1 G ∑ i∈G∥vt−vt i∥≤2ζ2 + 2β˜z2; 2.∥θk i∥≤bfor alli∈Gand∥θk∥≤˜c√ G; 3.∥vt−1∥≤ √ 64L ˜∆ + 3 ˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √β˜z, 4.∥∇f(xt−1)−vt−1∥≤ √ 4L ˜∆ + 3 2 ˜c√ G + 3 2 √ 2cδbyzζ+3 2 √ 2cδbyz √β˜z

work page
[18]

1 2∆≥2γ(1−β) k−1∑ l=0 ⟨∇f(xl)−∇f(xl+1),θl+1⟩

˜Φk≤2˜∆; 6. 1 2∆≥2γ(1−β) k−1∑ l=0 ⟨∇f(xl)−∇f(xl+1),θl+1⟩. Then, we will derive the result by induction, i.e., using the induction w.r.t.t, we will show that P( ˜Et)≥1−α(t+1) T+1 for allt∈{0,...,T−1}. Before moving on to the proof’s induction part, we need to establish several useful bounds. Denote the events˜Θt i and ˜Θt as ˜Θt i :={∥θt i∥≥˜b},˜Θt := { ∥θ...

work page
[19]

To establish the first, we use Lemma 12, which guarantees the bound in∩i∈GˆΘt, i.e., with probability1− α 8(T+1)

work page
[20]

The inequalities above again hold in˜Θ 0, i.e., with probability at least1−α 8(T+1).Therefore, the condition3of the induction is verified

v0 = 1 G ∑ i∈G((1−β)v−1 i +β∇fi(x0,ξ0 i )) = β G∇f(x0,ξ0).Therefore, we have ∥g0∥=β∥∇f(x0,ξ0)∥ ≤β∥∇f(x0)∥+β∥∇f(x0,ξ0)−∇f(x0)∥ ≤β √ 2L(f(x0)−f⋆) +β ˜c√ G ≤β √ 2L ˜Φ 0 +β ˜c√ G ≤β √ 4L ˜∆ +β ˜c√ G ≤ √ 64L ˜∆ + 3˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √ β˜z. The inequalities above again hold in˜Θ 0, i.e., with probability at least1−α 8(T+1).Therefore, the condition3...

work page
[21]

The bound above holds with probability at least1− α 8(T+1) because it holds in∩i∈G˜Θ 0 i

We have ∥v0−∇f(x0)∥=∥β∇f(x0,ξ0)−∇f(x0)∥ ≤β∥∇f(x0,ξ0)−∇f(x0)∥+(1−β)∥∇f(x0)∥ ≤βb+ (1−β) √ 4L ˜∆. The bound above holds with probability at least1− α 8(T+1) because it holds in∩i∈G˜Θ 0 i. Therefore, the bound4of the assumption of the induction is verified

work page
[22]

Next, we emphasize that the condition8of the induction assumption also hold, as ˜Φ 0 ≤ 2 ˜Φ 0≤2˜∆by the choice of ˜∆

work page
[23]

We finalize the induction base by noting that the condition6of the induction assumption holds since the RHS equals0. Therefore, we conclude that the conditions1-8hold with a probability of at least P ( ˜Θ 0∩ ( ∩i∈G˜Θ 0 i ) ∩ ( ∩i∈GˆΘ 0 i )) ≥1−P(˜Θ 0)− ∑ i∈G P( ˜Θ 0 i )− ∑ i∈G P( ˆΘ 0 i ) ≥1−α 8(T+ 1) −G·α 8G(T+ 1) −G·α 8G(T+ 1) = 1− α 2(T+ 1) >1− α T+ 1 ...

work page
[24]

MLP, 1 Byzantine Clients, ^± = 0.04 50 60 70 80 90Final T est Accuracy Byz-Clip-SGD Safe-DSHB Byz-Clip21-SGD2M+ 0:1 0:3 1 3 Privacy Budget

This implies that I+II+III≤ ˜∆, i.e., condition6in the induction assumption holds. Moreover, this also implies that ˜ΦK+1≤˜Φ 0 + ˜∆≤˜∆ + ˜∆ = 2 ˜∆, i.e., condition5in the induction assumption holds. The probabilityP(˜EK+1)can be lower bounded as follows P( ˜EK+1)≥P(Ω) =P ( ˜EK∩˜ΘK+1∩ ( ∩i∈G˜ΘK+1 i ) ∩ ( ∩i∈GˆΘK+1 i ) ∩EI∩EII∩EIII ) = 1−P ( ˜EK∪˜ΘK+1∪ ( ∪i...

work page 2020