pith. machine review for the scientific record. sign in

arxiv: 2603.23472 · v2 · submitted 2026-03-24 · 💻 cs.LG · cs.CR· math.OC

Recognition: 2 theorem links

· Lean Theorem

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CRmath.OC
keywords Byzantine robustnessdifferential privacyfederated learningconvergence analysisrobust aggregationclippingdouble momentumstochastic optimization
0
0 comments X

The pith

A new federated algorithm called Byz-Clip21-SGD2M achieves high-probability convergence under L-smoothness and sub-Gaussian noise even with Byzantine attacks and differential privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Byz-Clip21-SGD2M, which combines robust aggregation, double momentum, and clipping to train models across heterogeneous clients while resisting malicious server manipulations and satisfying privacy requirements. It establishes high-probability convergence guarantees using only the standard assumptions that the loss is L-smooth and gradient noise is sub-Gaussian, conditions weaker than the bounded gradients required by most prior analyses. Without attacks the method recovers state-of-the-art rates; with attacks and privacy constraints it delivers stronger utility bounds. This matters because federated systems can now be analyzed under assumptions that hold for many practical losses and noise models rather than artificial restrictions.

Core claim

Byz-Clip21-SGD2M integrates robust aggregation with double momentum and carefully designed clipping. Under L-smoothness and σ-sub-Gaussian gradient noise it proves high-probability convergence that recovers optimal rates in the absence of adversaries and improves utility when Byzantine attacks and differential privacy are present.

What carries the argument

Byz-Clip21-SGD2M, which performs robust aggregation of clipped and double-momentum updates to maintain convergence despite adversarial manipulation and privacy noise.

If this is right

  • Convergence rates match the best known rates for non-adversarial federated SGD when no attacks occur.
  • Utility bounds improve over prior Byzantine-robust or DP methods when both attacks and privacy constraints are active.
  • High-probability guarantees apply directly to heterogeneous client data without extra bounded-gradient assumptions.
  • Empirical results on MNIST with CNN and MLP models confirm practical effectiveness under the stated conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other distributed settings where servers are untrusted but data heterogeneity is high.
  • Relaxing bounded-gradient requirements may allow analysis of larger models or non-convex losses common in deep learning.
  • Implementations could combine this clipping and momentum design with existing secure aggregation protocols for end-to-end privacy.
  • The same machinery might yield tighter bounds for other robust aggregation rules beyond the one used here.

Load-bearing premise

The loss function stays L-smooth and the gradient noise remains σ-sub-Gaussian even after Byzantine manipulation and clipping.

What would settle it

An explicit loss function that is L-smooth with sub-Gaussian noise for which the algorithm diverges with positive probability under a valid Byzantine attack would disprove the convergence claim.

Figures

Figures reproduced from arXiv: 2603.23472 by Alexander Gaponov, Aurelien Lucchi, Eduard Gorbunov, Grigory Malinovsky, Peter Richt\'arik, Rustem Islamov.

Figure 1
Figure 1. Figure 1: Performance of Byz-Clip21-SGD2M, Byz-Clip-SGD (Algorithm 3), and Safe-DSHB (Al￾gorithm 4) when training CNN (top line) and MLP (bottom line) models on the MNIST dataset for different numbers of Byzantine clients and privacy budgets, when Byzantine clients use IPM attack. In Corollary 5.1, the first term corresponds to the error arising from privacy, while the last term reflects the error introduced by the … view at source ↗
Figure 2
Figure 2. Figure 2: Performance of Byz-Clip21-SGD2M, Byz-Clip-SGD (Algorithm 3), and Safe-DSHB (Al￾gorithm 4) when training CNN (top line) and MLP (bottom line) models on the MNIST dataset for different numbers of Byzantine clients and privacy budgets, when Byzantine clients use label flipping attack. Absence of DP and Byzantine Adversaries. Setting σω = 0 and δbyz = 0 significantly simplifies the choice of hyperparameters. I… view at source ↗
read the original abstract

Federated Learning (FL) enables heterogeneous clients to collaboratively train a shared model without centralizing their raw data, offering an inherent level of privacy. However, gradients and model updates can still leak sensitive information, while malicious servers may mount adversarial attacks such as Byzantine manipulation. These vulnerabilities highlight the need to address differential privacy (DP) and Byzantine robustness within a unified framework. Existing approaches, however, often rely on unrealistic assumptions such as bounded gradients, require auxiliary server-side datasets, or fail to provide convergence guarantees. We address these limitations by proposing Byz-Clip21-SGD2M, a new algorithm that integrates robust aggregation with double momentum and carefully designed clipping. We prove high-probability convergence guarantees under standard $L$-smoothness and $\sigma$-sub-Gaussian gradient noise assumptions, thereby relaxing conditions that dominate prior work. Our analysis recovers state-of-the-art convergence rates in the absence of adversaries and improves utility guarantees under Byzantine and DP settings. Empirical evaluations on CNN and MLP models trained on MNIST further validate the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Byz-Clip21-SGD2M, a federated optimization algorithm integrating clipping (for DP and robustness), double momentum, and a robust aggregator. It claims high-probability convergence guarantees under standard L-smoothness and σ-sub-Gaussian gradient noise assumptions, recovering state-of-the-art rates without adversaries while improving utility under Byzantine and DP settings, with validation via CNN/MLP experiments on MNIST.

Significance. If the analysis holds, the work would be significant for relaxing the bounded-gradient assumptions prevalent in prior Byzantine-robust and differentially private federated learning literature, providing a unified framework with explicit high-probability bounds and empirical support on standard benchmarks.

major comments (1)
  1. [Abstract / Theoretical Analysis] Abstract and theoretical analysis: The high-probability convergence claims rest on the assumption that gradient noise remains σ-sub-Gaussian after clipping and double-momentum robust aggregation, even under Byzantine clients and heterogeneous data. Clipping truncates tails and Byzantine manipulation can alter the effective distribution around differing per-client means, so the invariance is not immediate from the stated L-smoothness and sub-Gaussian assumptions alone; explicit justification or a counterexample check is required for the central bounds to apply.
minor comments (2)
  1. [Experiments] Experiments section: Report error bars across multiple runs and specify how the clipping threshold (a free parameter) is selected or tuned, to strengthen reproducibility of the MNIST results.
  2. [Algorithm Description] Notation: Clarify the exact form of the double-momentum update and the robust aggregator (e.g., coordinate-wise median or trimmed mean) with equation numbers for direct comparison to prior work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Analysis] Abstract and theoretical analysis: The high-probability convergence claims rest on the assumption that gradient noise remains σ-sub-Gaussian after clipping and double-momentum robust aggregation, even under Byzantine clients and heterogeneous data. Clipping truncates tails and Byzantine manipulation can alter the effective distribution around differing per-client means, so the invariance is not immediate from the stated L-smoothness and sub-Gaussian assumptions alone; explicit justification or a counterexample check is required for the central bounds to apply.

    Authors: We agree that the preservation of sub-Gaussianity after clipping and robust aggregation is not immediate and merits explicit treatment. Clipping at a threshold proportional to the noise scale preserves sub-Gaussian tails up to a constant factor (by standard concentration arguments for truncated sub-Gaussian random variables), while the double-momentum update and robust aggregator (e.g., coordinate-wise median or trimmed mean) yield an effective noise term whose deviation can be bounded via vector concentration inequalities that continue to hold under per-client heterogeneity. We will insert a new supporting lemma (and its proof) in the appendix that derives the adjusted sub-Gaussian parameter for the aggregated gradient under Byzantine corruption, thereby making the high-probability bounds fully rigorous. This addition does not change the main theorem statements or rates. revision: yes

Circularity Check

0 steps flagged

No circularity: proof rests on independent standard assumptions

full rationale

The paper states a convergence analysis under L-smoothness and σ-sub-Gaussian gradient noise. No derivation step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain. The central claim is a high-probability bound derived from these external assumptions rather than from the algorithm's outputs or prior author results. The skeptic concern about post-clipping noise distribution is a question of assumption validity, not circularity in the derivation itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard optimization assumptions plus the effectiveness of the new clipping and momentum mechanisms; no new entities are postulated.

free parameters (2)
  • clipping threshold
    Carefully designed clipping parameter that balances robustness, privacy, and convergence; value is chosen to satisfy the stated utility bounds.
  • momentum coefficients
    Double-momentum parameters that must be set to achieve the claimed acceleration and stability.
axioms (2)
  • domain assumption The objective function is L-smooth
    Invoked throughout the convergence analysis to control gradient differences.
  • domain assumption Gradient noise is σ-sub-Gaussian
    Weaker noise assumption replacing the bounded-gradient condition used in prior Byzantine-robust work.

pith-pipeline@v0.9.0 · 5501 in / 1485 out tokens · 47198 ms · 2026-05-14T23:59:36.843344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    ˆβ≤min {√ L∆ a , √ L∆ 4 √ 2cδbyzˆa,1 } ; 4.∥mt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3ˆβa+ˆβ √ 2cδbyzˆa+ 3 √ 2cδbyz ˆβτT, 5.∥gt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3 √ 2cδbyz ˆβτT; 6.∥∇fi(xt−1)−vt−1 i ∥≤ √ 4L∆ + 3 2(Binit−τ) +3 2b+ ˆβa+ 4ˆβ √ 2cδbyzˆa+3 2 √ 2cδbyz ˆβτTfor all i∈G; 7.∥vt i−gt−1 i ∥≤Binit for alli∈G; 8.∥Ωt∥≤a; 9.∥Ωt−1 i ∥≤ˆafor alli∈G; 10.∥θt i∥≤bfor a...

  2. [2]

    ˆβ∈min {√ L∆ a , √ L∆ 4 √ 2cδbyzˆa, 1 24 } ; 7.∥mt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3ˆβa+ˆβ √ 2cδbyzˆa+ 3 √ 2cδbyz ˆβτT, 8.∥gt−1∥≤ √ 64L∆ + 3(B init−τ) + 3b+ 3 √ 2cδbyz ˆβτT; 9.∥∇fi(xt−1)−vt−1 i ∥≤ √ 4L∆ + 3 2(Binit−τ) +3 2b+ ˆβa+ 4ˆβ √ 2cδbyzˆa+3 2 √ 2cδbyz ˆβτTfor all i∈G; 10.∥Ωt i∥≤ˆafor alli∈G; 11.∥θt i∥≤band∥θt+1 i ∥≤bfor alli∈G. Then we have ∥vt+1 i −...

  3. [3]

    We have already shown that P    1 G ∑ i∈G ω1 i  ≥a  ≤ α 8(T+ 1) , implying that  1 G ∑ i∈Gω1 i ≤awith probability at least1−α 8(T+1).Similarly, we have shown that∥∑1 l=1ωl i∥≤ˆawith probability1−α 8G(T+1).Therefore,

  4. [4]

    Note that if one of the maximum terms in the last sum is actually zero, then the bound is easier to satisfy

    g0 = 1 G ∑ i∈G(g−1 i + ˆβclipτ(v0 i−g−1 i ) = 1 G ∑ i∈Gˆβclipτ(β∇fi(x0,ξ0 i )).Therefore, we have ∥g0∥=  1 G ∑ i∈G ˆββ∇fi(x0) + ˆββθ0 i + (ˆβclipτ(β∇fi(x0,ξ0 i ))−ˆββ∇fi(x0,ξ0 i ))  ≤ˆββ∥∇f(x0)∥+ ˆββ G ∑ i∈G ∥θ0 i∥+1 G ∑ i∈G max { 0,β∥∇fi(x0,ξ0 i )∥−τ } ≤ˆββ √ 2L(f(x0)−f(x⋆)) + ˆββ G ∑ i∈G ∥θ0 i∥+ ˆβ G ∑ i∈G max { 0,β∥∇fi(x0)∥+β∥θ0 i∥−τ } . No...

  5. [5]

    Therefore, the bound5of the assumption of the induction is verified

    We have ∥v0 i−∇fi(x0)∥=∥β∇fi(x0,ξ0 i )−∇fi(x0)∥ ≤β∥∇fi(x0,ξ0 i )−∇fi(x0)∥+(1−β)∥∇fi(x0)∥ ≤βb+ (1−β)Binit The bound above holds with probability at least1− α 8(T+1) because it holds in∩i∈GΘ 0 i. Therefore, the bound5of the assumption of the induction is verified

  6. [6]

    Next, we emphasize that the condition8of the induction assumption also hold, asΦ 0 ≤ 2Φ 0≤2∆by the choice of∆

  7. [7]

    We finalize the induction base by noting that the condition9of the induction assumption holds since the RHS equals0. Therefore, we conclude that the conditions1-8hold with a probability of at least P ( Θ 0∩ ( ∩i∈GΘ 0 i ) ∩Nt∩ ( ∩i∈GNt i )) ≥1−P(Θ0)− ∑ i∈G P(Θ 0 i )−P(N0)− ∑ i∈G P(N 0 i ) ≥1−α 8(T+ 1) −G·α 8G(T+ 1) − α 8(T+ 1) −G·α 8G(T+ 1) = 1− α 2(T+ 1) ...

  8. [8]

    Moreover, by the definition ofEt we get that the eventEK∩ΘK+1∩ ( ∩i∈GΘK+1 i ) ∩NK+1∩( ∩i∈GNK+1 i ) implies ζt 1,i =g t i−vt i, ζt 2,i =v t i−∇fi(xt), ζt 3,i =∇fi(xt)−∇fi(xt+1), ζt 4 =v t−∇f(xt), ζt 5 =∇f(xt)−∇f(xt+1). Therefore, the eventEK∩ΘK+1∩ ( ∩i∈GΘK+1 i ) ∩NK+1∩ ( ∩i∈GNK+1 i ) implies ΦK+1≤Φ0 +Kb 2 ( 2β2γ ˆβη + 8γβ3 ˆβ2η2 ) +Kˆc2 2γβ G    ① + 4γ...

  9. [9]

    Moreover, this also implies that ΦK+1≤Φ0 + ∆≤∆ + ∆ = 2∆, i.e., condition6in the induction assumption holds

    This implies that ①+②+③+④+⑤+⑥+⑦+⑧+⑨+⑩≤7· ∆ 8 + 3·∆ 24 = ∆, i.e., condition7in the induction assumption holds. Moreover, this also implies that ΦK+1≤Φ0 + ∆≤∆ + ∆ = 2∆, i.e., condition6in the induction assumption holds. The probabilityP(EK+1)can be lower bounded as follows P(EK+1)≥P(Ω) =P ( EK∩ΘK+1∩ ( ∩i∈GΘK+1 i ) ∩NK+1∩ ( ∩i∈GNK+1 i ) ∩E①∩E②∩E③∩E④∩E⑤∩E⑥ ∩E...

  10. [10]

    ˆβ∼ √ L∆ aC , wherea∼˜O( √ dσω √ T/ √ G), gives the term in the rate of the form L∆ T ˜O   ( Tσ2 L∆ ˆβ2η2 )1/4 + ( Binitσ √ T L∆ √ G ˆβη )1/2 + ( σ( √ L∆ +B init +σ) √ T L∆ √ G ˆβ2η2 )1/3 + ( σ √ cδbyzτT3/2 L∆ √ G ˆβη )1/3  = L∆ T ˜O   ( B2 initC2a2Tσ2 (L∆) 2τ2 )1/4 + ( B2 initCaσT1/2 (L∆) 3/2√ Gτ )1/2 + ( B2 initC2a2σ( √ L∆ +B init +σ) √ T (L∆) 2√ ...

  11. [11]

    ˆβ∼ √ L∆ C1a2/3T 1/3τ1/3, whereC 1 = 1 +G1/3(cδbyz)1/3 (we combine cases 2 and 3 in (84) together to simplify calculations). We obtain L∆ T ˜O   ( Tσ2 L∆ ˆβ2η2 )1/4 + ( Binitσ √ T L∆ √ G ˆβη )1/2 + ( σ( √ L∆ +B init +σ) √ T L∆ √ G ˆβ2η2 )1/3 + ( σ √ cδbyzτT3/2 L∆ √ G ˆβη )1/3  = ˜O   ( (L∆) 2B2 initC2 1a4/3σ2 T 7/3τ4/3 )1/4 + (√ L∆B 2 initC1a2/3σ T ...

  12. [12]

    We keep only the last two terms, since they have the worst dependence onT

    ˆβ∼ √ L∆ (cδbyz)1/3τT, gives the term in the rate of the form L∆ T ˜O   ( Tσ2 L∆ ˆβ2η2 )1/4 + ( Binitσ √ T L∆ √ G ˆβη )1/2 + ( σ( √ L∆ +B init +σ) √ T L∆ √ G ˆβ2η2 )1/3 + ( σ √ cδbyzτT3/2 L∆ √ G ˆβη )1/3  = ˜O   ( (L∆) 2(cδbyz)2/3B2 initσ2 T )1/4 + ( (L∆) 1/2B2 init(cδbyz)1/3σ√ GT )1/2 + ( L∆B 2 init(cδbyz)2/3σ( √ L∆ +B init +σ)√ TG )1/3 + ( (cδbyz)...

  13. [13]

    1 G ∑ i∈G∥vt−1−vt−1 i ∥2≤2ζ2 + 2β˜z2; 58 Then we have f(xt+1)≤f(xt)−γ 2∥∇f(xt)∥2−1 4γ∥xt+1−xt∥2+γ∥∇f(xt)−vt∥2+2cδbyzζ2 + 2cδbyzβ˜z2.(100) Proof.Using the derivations from [Islamov et al., 2025b], Lemma 2 we first get f(xt+1)≤f(xt)−γ 2∥∇f(xt)∥2−γ 4∥gt∥2+γ 2∥∇f(xt)−gt∥2. We continue the derivations as follows f(xt+1) (i) ≤f(xt)−γ 2∥∇f(xt)∥2−γ 4∥gt∥2+γ∥∇f(xt...

  14. [14]

    1 G ∑ i∈G∥vt−1−vt−1 i ∥2≤2cδbyzζ2 + 2cδbyzβ˜z2, 7.Φ t−1≤2˜∆. Then we have ∥vt∥≤ √ 64L ˜∆ + 3 ˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √ β˜z.(103) Proof.We start as follows ∥vt∥ (i) =∥(1−β)vt−1+β∇f(xt,ξt)∥ =∥(1−β)(vt−1−∇f(xt−1)) +β(∇f(xt)−∇f(xt−1)) +β(∇f(xt,ξt)−∇f(xt)) +∇f(xt−1)∥ (ii) ≤(1−β)∥vt−1−∇f(xt−1)∥+βLγ∥gt−1∥+β∥θt∥+∥∇f(xt−1)∥.(104) 59 where(i)follows from th...

  15. [15]

    1 G ∑ i∈G∥vt−1 i −vt−1∥2≤2cδbyzζ2 + 2cδbyzβ˜z2; 5.∥vt−1∥≤ √ 64L ˜∆ + 3 ˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √β˜z, 6.∥∇f(xt−1)−vt−1∥≤ √ 4L ˜∆ + 3 2 ˜c√ G + 3 2 √ 2cδbyzζ+3 2 √ 2cδbyz √β˜z; 7.∥θt∥≤˜c√ G; 8.∥θt i∥≤˜bfor alli∈G; 60 9.Φ t−1≤2˜∆. Then we have ∥∇f(xt)−vt∥≤ √ 4L ˜∆ + 3 2 ˜c√ G + 3 2 √ 2cδbyzζ+3 2 √ 2cδbyz √ β˜z.(107) Proof.We have ∥∇f(xt)−vt∥ (i) =∥∇f...

  16. [16]

    and momentum restrictions defined in(112),(113), and(114). Then, with probability at least1−α, we bound1 T ∑T−1 t=0∥∇f(xt)∥2 with ˜O  L ˜∆ T + ( σ2L ˜∆ GT )1/2 + σ( √ L ˜∆ +σ/ √ G+ √ cδbyzζ+ √ cδbyzσ)√ GT +cδbyzζ2  ,(109) where we can choose˜∆ = 2(f(x 0)−f⋆) = 2F 0. Proof.For convenience, we define∇fi(x−1,ξ−1 i ) =v −1 i =g−1 i = 0,Φ −1= Φ 0. Next, le...

  17. [17]

    1 G ∑ i∈G∥vt−vt i∥≤2ζ2 + 2β˜z2; 2.∥θk i∥≤bfor alli∈Gand∥θk∥≤˜c√ G; 3.∥vt−1∥≤ √ 64L ˜∆ + 3 ˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √β˜z, 4.∥∇f(xt−1)−vt−1∥≤ √ 4L ˜∆ + 3 2 ˜c√ G + 3 2 √ 2cδbyzζ+3 2 √ 2cδbyz √β˜z

  18. [18]

    1 2∆≥2γ(1−β) k−1∑ l=0 ⟨∇f(xl)−∇f(xl+1),θl+1⟩

    ˜Φk≤2˜∆; 6. 1 2∆≥2γ(1−β) k−1∑ l=0 ⟨∇f(xl)−∇f(xl+1),θl+1⟩. Then, we will derive the result by induction, i.e., using the induction w.r.t.t, we will show that P( ˜Et)≥1−α(t+1) T+1 for allt∈{0,...,T−1}. Before moving on to the proof’s induction part, we need to establish several useful bounds. Denote the events˜Θt i and ˜Θt as ˜Θt i :={∥θt i∥≥˜b},˜Θt := { ∥θ...

  19. [19]

    To establish the first, we use Lemma 12, which guarantees the bound in∩i∈GˆΘt, i.e., with probability1− α 8(T+1)

  20. [20]

    The inequalities above again hold in˜Θ 0, i.e., with probability at least1−α 8(T+1).Therefore, the condition3of the induction is verified

    v0 = 1 G ∑ i∈G((1−β)v−1 i +β∇fi(x0,ξ0 i )) = β G∇f(x0,ξ0).Therefore, we have ∥g0∥=β∥∇f(x0,ξ0)∥ ≤β∥∇f(x0)∥+β∥∇f(x0,ξ0)−∇f(x0)∥ ≤β √ 2L(f(x0)−f⋆) +β ˜c√ G ≤β √ 2L ˜Φ 0 +β ˜c√ G ≤β √ 4L ˜∆ +β ˜c√ G ≤ √ 64L ˜∆ + 3˜c√ G + 3 √ 2cδbyzζ+ 3 √ 2cδbyz √ β˜z. The inequalities above again hold in˜Θ 0, i.e., with probability at least1−α 8(T+1).Therefore, the condition3...

  21. [21]

    The bound above holds with probability at least1− α 8(T+1) because it holds in∩i∈G˜Θ 0 i

    We have ∥v0−∇f(x0)∥=∥β∇f(x0,ξ0)−∇f(x0)∥ ≤β∥∇f(x0,ξ0)−∇f(x0)∥+(1−β)∥∇f(x0)∥ ≤βb+ (1−β) √ 4L ˜∆. The bound above holds with probability at least1− α 8(T+1) because it holds in∩i∈G˜Θ 0 i. Therefore, the bound4of the assumption of the induction is verified

  22. [22]

    Next, we emphasize that the condition8of the induction assumption also hold, as ˜Φ 0 ≤ 2 ˜Φ 0≤2˜∆by the choice of ˜∆

  23. [23]

    We finalize the induction base by noting that the condition6of the induction assumption holds since the RHS equals0. Therefore, we conclude that the conditions1-8hold with a probability of at least P ( ˜Θ 0∩ ( ∩i∈G˜Θ 0 i ) ∩ ( ∩i∈GˆΘ 0 i )) ≥1−P(˜Θ 0)− ∑ i∈G P( ˜Θ 0 i )− ∑ i∈G P( ˆΘ 0 i ) ≥1−α 8(T+ 1) −G·α 8G(T+ 1) −G·α 8G(T+ 1) = 1− α 2(T+ 1) >1− α T+ 1 ...

  24. [24]

    MLP, 1 Byzantine Clients, ^± = 0.04 50 60 70 80 90Final T est Accuracy Byz-Clip-SGD Safe-DSHB Byz-Clip21-SGD2M+ 0:1 0:3 1 3 Privacy Budget

    This implies that I+II+III≤ ˜∆, i.e., condition6in the induction assumption holds. Moreover, this also implies that ˜ΦK+1≤˜Φ 0 + ˜∆≤˜∆ + ˜∆ = 2 ˜∆, i.e., condition5in the induction assumption holds. The probabilityP(˜EK+1)can be lower bounded as follows P( ˜EK+1)≥P(Ω) =P ( ˜EK∩˜ΘK+1∩ ( ∩i∈G˜ΘK+1 i ) ∩ ( ∩i∈GˆΘK+1 i ) ∩EI∩EII∩EIII ) = 1−P ( ˜EK∪˜ΘK+1∪ ( ∪i...