Federated Transfer Learning with Differential Privacy

Mengchu Li; Yang Feng; Ye Tian; Yi Yu

arxiv: 2403.11343 · v4 · submitted 2024-03-17 · 💻 cs.LG · cs.CR· math.ST· stat.ME· stat.ML· stat.TH

Federated Transfer Learning with Differential Privacy

Mengchu Li , Ye Tian , Yang Feng , Yi Yu This is my paper

Pith reviewed 2026-05-24 03:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CRmath.STstat.MEstat.MLstat.TH

keywords privacydatafederatedlearningdifferentialtransfercentralchallenges

0 comments

The pith

Introduces federated differential privacy as an intermediate model between local and central DP and analyzes minimax rates for four statistical tasks under heterogeneity and privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tackles two issues in federated learning: data that differs across locations and the need to keep each location's data private. It defines federated differential privacy, a privacy standard that protects every dataset without requiring a trusted central party to see the raw data. The authors then examine four estimation tasks: finding a simple average, fitting low-dimensional and high-dimensional linear models, and general M-estimation. They calculate the lowest possible error rates under this privacy rule and show that the new privacy model sits between the stricter local privacy model and the more permissive central privacy model. The analysis includes how differences in data distributions affect accuracy and how sharing knowledge from source datasets can help the target dataset.

Core claim

we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy.

Load-bearing premise

The paper assumes that minimax rates for the four listed statistical problems can be derived while simultaneously incorporating both data heterogeneity across sites and the federated differential privacy constraint without a trusted server (abstract).

Figures

Figures reproduced from arXiv: 2403.11343 by Mengchu Li, Yang Feng, Ye Tian, Yi Yu.

**Figure 2.** Figure 2: Comparison of estimation errors under different DP notions, when the sample [PITH_FULL_IMAGE:figures/full_fig_p030_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of different methods under varying degrees of heterogeneity between [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗

**Figure 4.** Figure 4: An illustration of the informative source detection strategy. The blue dash-line [PITH_FULL_IMAGE:figures/full_fig_p038_4.png] view at source ↗

read the original abstract

Federated learning has emerged as a powerful framework for analysing distributed data, yet two challenges remain pivotal: heterogeneity across sites and privacy of local data. In this paper, we address both challenges within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of federated differential privacy, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy model, we study four statistical problems: univariate mean estimation, low-dimensional linear regression, high-dimensional linear regression, and M-estimation. By investigating the minimax rates and quantifying the cost of privacy, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses account for data heterogeneity and privacy, highlighting the fundamental costs associated with each factor and the benefits of knowledge transfer in federated learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines federated differential privacy as an intermediate model without a trusted server and derives minimax rates for mean estimation, regressions, and M-estimation in heterogeneous transfer learning, but the derivations need checking to confirm the strict positioning between local and central DP.

read the letter

This paper defines federated differential privacy as an intermediate model without a trusted server and derives minimax rates for mean estimation, regressions, and M-estimation in heterogeneous transfer learning, but the derivations need checking to confirm the strict positioning between local and central DP after accounting for site shifts. It formulates the privacy notion explicitly for federated settings and applies it to transfer learning across sources and target, with analyses that try to separate the costs of privacy and heterogeneity. The abstract indicates they quantify how knowledge transfer helps under these constraints. That framing is the main new piece. The work does a clean job listing the four statistical problems and stating the goal of showing an intermediate privacy level with concrete rates. The citation pattern looks standard for DP and federated learning papers. The soft spot is verification. The abstract asserts the rates place federated DP strictly between local and central while handling heterogeneity, but without the actual proofs or the precise mechanism for noise aggregation and distribution shifts, it is not possible to confirm the intermediate claim holds. The stress-test point lands here: if the modeling of heterogeneity turns out loose or the privacy composition does not deliver the expected separation, the central positioning weakens. No circularity shows up in the stated claims. This paper is for theorists working on privacy-utility tradeoffs in distributed statistical estimation. A reader who wants benchmark rates for regulated federated settings would get value once the math is confirmed. It deserves a serious referee to check the derivations and the heterogeneity handling. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript formulates federated differential privacy (no trusted server) in a transfer-learning setting with heterogeneous source and target datasets. It derives minimax rates for univariate mean estimation, low-dimensional linear regression, high-dimensional linear regression, and M-estimation under this privacy model, claims these rates lie strictly between the local-DP and central-DP rates, and quantifies the separate costs of privacy and heterogeneity together with the benefit of knowledge transfer.

Significance. If the minimax derivations are correct and the intermediate positioning holds after accounting for heterogeneity, the work would supply a new, realistic privacy model for federated settings and explicit rate characterizations that separate privacy cost from heterogeneity cost across four canonical problems. The explicit treatment of transfer across heterogeneous sites is a strength.

major comments (3)

[Abstract, §3] Abstract and §3: the central claim that federated DP rates are strictly between local and central DP requires explicit side-by-side statements of the three rates (local, federated, central) for each of the four problems; without these comparisons the intermediate positioning cannot be verified from the stated results.
[§4] §4 (heterogeneity modeling): the minimax formulation must incorporate a concrete heterogeneity parameter (e.g., bounded mean shift or total-variation distance between source and target distributions) that appears in the rate expressions; the current treatment leaves the precise interaction between heterogeneity and the federated privacy constraint unspecified, which is load-bearing for separating the two costs.
[§5–§8] §5–§8 (proofs of minimax rates): the abstract asserts rigorous derivations, yet the provided text does not contain the full proofs or the precise local-randomization mechanism that aggregates noise without a server; verification that the rates are indeed strictly better than local DP while respecting the no-trusted-server constraint is therefore impossible.

minor comments (2)

Notation for the privacy parameters (ε, δ) and the heterogeneity radius should be introduced once and used consistently across all four problem sections.
Figure captions should state the precise values of n, m, d, K, and the heterogeneity parameter used in each plotted curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: the central claim that federated DP rates are strictly between local and central DP requires explicit side-by-side statements of the three rates (local, federated, central) for each of the four problems; without these comparisons the intermediate positioning cannot be verified from the stated results.

Authors: We agree that explicit comparisons are needed for verification. In the revision we will add a table (or explicit statements) in the abstract and Section 3 listing the minimax rates under local DP, federated DP, and central DP for univariate mean estimation, low-dimensional linear regression, high-dimensional linear regression, and M-estimation, confirming the strict intermediate positioning. revision: yes
Referee: [§4] §4 (heterogeneity modeling): the minimax formulation must incorporate a concrete heterogeneity parameter (e.g., bounded mean shift or total-variation distance between source and target distributions) that appears in the rate expressions; the current treatment leaves the precise interaction between heterogeneity and the federated privacy constraint unspecified, which is load-bearing for separating the two costs.

Authors: We will revise Section 4 to introduce an explicit heterogeneity parameter (e.g., a bound on mean shift or total-variation distance between source and target distributions) that appears directly in the minimax rate expressions. This will make the interaction with the federated privacy constraint precise and allow clear separation of privacy and heterogeneity costs. revision: yes
Referee: [§5–§8] §5–§8 (proofs of minimax rates): the abstract asserts rigorous derivations, yet the provided text does not contain the full proofs or the precise local-randomization mechanism that aggregates noise without a server; verification that the rates are indeed strictly better than local DP while respecting the no-trusted-server constraint is therefore impossible.

Authors: The full proofs and the local-randomization mechanism (each site adds noise locally; aggregation occurs without a trusted server) are contained in the appendix of the arXiv version. We will add prominent references to the appendix in the main text (Sections 3 and 5–8) and include a concise description of the mechanism in Section 3. The derived rates are strictly better than local DP because transfer learning permits controlled information sharing under the federated constraint. revision: yes

Circularity Check

0 steps flagged

Minimax derivations under federated DP are independent of inputs

full rationale

The paper defines federated differential privacy as a new privacy model without a trusted server, then derives minimax rates for four statistical problems (univariate mean estimation, low-dimensional and high-dimensional linear regression, M-estimation) while incorporating heterogeneity. These steps rely on standard information-theoretic lower bounds and upper-bound constructions that quantify privacy cost separately from heterogeneity; no equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The intermediate positioning between local and central DP follows directly from comparing the derived rates rather than from re-labeling inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into parameters or axioms; the central claim rests on the definitional introduction of federated DP and the feasibility of rate derivations under heterogeneity.

axioms (1)

domain assumption Data heterogeneity across sites can be modeled while preserving privacy guarantees without a trusted server
Invoked when formulating the federated DP model and studying transfer benefits (abstract).

invented entities (1)

federated differential privacy no independent evidence
purpose: Privacy model that protects each dataset in federated transfer learning without a trusted central server
New notion introduced to sit between local and central DP; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5702 in / 1176 out tokens · 36258 ms · 2026-05-24T03:27:28.199957+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

General Lower Bounds for Differentially Private Federated Learning with Arbitrary Public-Transcript Interactions
cs.LG 2026-05 unverdicted novelty 8.0

Derives a federated van Trees lower bound under total clientwise sample-level zCDP for parameter estimation with squared l2 loss in federated learning protocols with arbitrary public-transcript interactions.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper

[1]

Algorithm 4 is (ϵ, δ)-central DP

work page
[2]

Suppose that n ≳ dT log n∨(T /δ) η log(T /[η(ϵ ∧ δ)]) ϵ , (35) and T = ⌈C log(n)⌉ for some absolute constant C > 0

Initialise Algorithm 4 with β0 = 0 and step size ρ = 18L(1 + 81L2)−1. Suppose that n ≳ dT log n∨(T /δ) η log(T /[η(ϵ ∧ δ)]) ϵ , (35) and T = ⌈C log(n)⌉ for some absolute constant C > 0. We then have with probability at least 1 − 7η that ∥βT − β∗∥2 ≲ ∥β∗∥2 n C 81L2+1 + log log(n) η r d log(n) n + d log2(n/η) p log(log(n)/η) log(1/δ) nϵ

work page
[3]

Lemma 9 shows that Algorithm 4 achieves the optimal convergence rate up to poly- logarithmic factors

In addition, suppose that ∥β∗∥2 ≤ C′ for some absolute constant C′ and C ≥ (81L2+1)/2, then we have ∥βT −β∗∥2 ≲ r(n, d, ϵ, δ, η) = log log(n) η r d log(n) n + d log2(n/η) p log(1/δ) log(log(n)/η)) nϵ with probability at least 1 − 7η. Lemma 9 shows that Algorithm 4 achieves the optimal convergence rate up to poly- logarithmic factors. Compared to Cai et al...

work page 2019
[4]

Let N =P k∈{0}∪A nk. For η ∈ (0, 1), under the conditions that min k∈{0}∪A nk ≳ T dlog(T /η) ∨ T log(T /(δη)) log(T /(η(ϵ ∧ δ))ϵ−1, R ≳ p d log(N/η) and Rt ≳ p log(N/η)PrivateVariance({X ⊤ τt+iβt − Yτt+i}b i=1, ϵ′, δ′), we have that P(E ′ 1 ∩ E ′ 2 ∩ E ′ 3 ∩ E ′

work page
[5]

Proof of Corollary 12

≥ 1 − 6η. Proof of Corollary 12. The proof is a generalisation of the single site result in Lemma 11 to multi-site. For brevity, we only point out the differences between controlling {Ei}i∈[4] and {E ′ i}i∈[4]. For E ′ 1, we note that the population version of X k∈{0}∪A nk b(k)N b(k) X i=1 X(k) τt+iX(k)⊤ τt+i is ˜Σ = P k∈{0}∪A nkΣ(k)/N, which has λmin(˜Σ)...

work page
[6]

For E ′ 2, the same arguments for controlling E2 in Lemma 11 still work, but with n by N in the choice of R to account for the union bound over N random variables

≥ 1 − η, as long as N ≳ T dlog(T /η). For E ′ 2, the same arguments for controlling E2 in Lemma 11 still work, but with n by N in the choice of R to account for the union bound over N random variables. The same arguments for E3 also works for E ′ 3 but with Σ replaced by Σ (k) where appro- priate, and notice that ∥βt − β∥Σ(k) ≲ ∥βt − β∥2 for any k. 56 For...

work page
[7]

Definition 2

≥ 1 − 2η. Definition 2. Given a data set D, we say a randomised algorithm M is (ϵ, δ)-central DP with respect to a set S ⊆ D, if P(M(D) ∈ O|D) ≤ eϵP(M(D′) ∈ O|D′) for any measurable set O and any data set D′ that can be obtained by altering at most one data entry in S. We use MS ϵ,δ to denote the set of all procedures that are (ϵ, δ)-central DP with respe...

work page 2019
[8]

9s′ 10s − 1 ξ2 + 22 5 L2ξ − 2 11 9 2 L4 # ≤ 1 − s s′ 5 11L2

(47) Writing fβ(y, x) as the joint density, we have fβ(y, x) = 1√ 2πσ m+n exp − Pn i=1(yi − x⊤ i β)2 +Pn+m i=n+1(yi − x⊤ i β′)2 2σ2 m+nY i=1 ϕ(xi), where ϕ(xi) is the density of N (0, I). Note that since β′ is not a function of β, we have ∂fβ(y, x) ∂β = fβ(y, x) σ2 nX i=1 (yi − x⊤ i β)xi, and therefore we have X i∈[n] EAi = X j∈[d] E {M(Y , X)}j X i∈[n] (...

work page 2019
[9]

= s s log(d/η) log(n0) n0 + s log1/2(1/δ) log5/2(n0d/η) n0ϵ ,

work page
[10]

= s s log(d/η) log(nA + n0) nA + n0 + h + p |A|ds log1/2(1/δ) log5/2[((nA + n0)d)/η] (nA + n0)ϵ . Case 1: When √ | ˆA|ds′ log1/2(1/δ) log5/2[((n ˆA+n0)d)/η] (n ˆA+n0)ϵ ≤ C0rHLR(n0, s′, d, ϵ, δ, η) ≲ rHLR(n0, s, d, ϵ, δ, η) and h ≤ crHLR(n0, s, d, ϵ, δ, η), where c is the constant in Proposition 18.(iii): We have

work page
[11]

≲ [1], ˆA = A with probability at least 1 − η by Proposition 18.(iii), and the bound [2] follows from Proposition 19.(ii). Case 2: When q | ˆA|ds′ log1/2(1/δ) log5/2[((n ˆA + n0)d)/η] (n ˆA + n0)ϵ ≤ C0rHLR(n0, s′, d, ϵ, δ, η) and h > cr HLR(n0, s, d, ϵ, δ, η), where c is the constant in Proposition 18.(iii): [2] ≳ [1]. By Proposition 18.(ii), we know ∥β(k...

work page 2019
[12]

1 2 + Cγ s s′ log(d/η) n/T + c # ∥βt − β∗∥2 Σ + C′ s′ log(d/η) n/T . (70) Similarly, Lt n(βt+1) − Lt n(β∗) ≥

Going back to (58), we have Lt n(βt+1) − Lt n(βt) ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t∥2 2 + (1 − ξ)⟨βt+1 − βt, gt⟩ 71 ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t\(St∪S)∥2 2 − ξ2 2γ ∥gt St∪S∥2 2 − 9ξ 20γ (1 − ξ)∥gt St+1∪St∥2 2 + Cs′∥wt∥2 ∞. Consider a set S′ ⊆ St\St+1 with |S′| = |I t\(St ∪ S)| = |St+1\(St ∪ S)|. A...

work page 2019
[13]

1 2 + Cγ s s′ log(d/η) N/T + c # ∥βt − β(0)∥2 Σ + s′ log(d/η) N/T + h2. (86) Similarly, Lt N(βt+1) − Lt N(β(0)) ≥

Going back to (75), we have Lt N(βt+1) − Lt N(βt) ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t∥2 2 + (1 − ξ)⟨βt+1 − βt, gt⟩ ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t\(St∪S)∥2 2 − ξ2 2γ ∥gt St∪S∥2 2 − 9ξ 20γ (1 − ξ)∥gt St+1∪St∥2 2 + Cs′∥wt∥2 ∞. Consider a set S′ ⊆ St\St+1 with |S′| = |I t\(St ∪ S)| = |St+1\(St ∪ S)|. Appl...

work page 2019

[1] [1]

Algorithm 4 is (ϵ, δ)-central DP

work page

[2] [2]

Suppose that n ≳ dT log n∨(T /δ) η log(T /[η(ϵ ∧ δ)]) ϵ , (35) and T = ⌈C log(n)⌉ for some absolute constant C > 0

Initialise Algorithm 4 with β0 = 0 and step size ρ = 18L(1 + 81L2)−1. Suppose that n ≳ dT log n∨(T /δ) η log(T /[η(ϵ ∧ δ)]) ϵ , (35) and T = ⌈C log(n)⌉ for some absolute constant C > 0. We then have with probability at least 1 − 7η that ∥βT − β∗∥2 ≲ ∥β∗∥2 n C 81L2+1 + log log(n) η r d log(n) n + d log2(n/η) p log(log(n)/η) log(1/δ) nϵ

work page

[3] [3]

Lemma 9 shows that Algorithm 4 achieves the optimal convergence rate up to poly- logarithmic factors

In addition, suppose that ∥β∗∥2 ≤ C′ for some absolute constant C′ and C ≥ (81L2+1)/2, then we have ∥βT −β∗∥2 ≲ r(n, d, ϵ, δ, η) = log log(n) η r d log(n) n + d log2(n/η) p log(1/δ) log(log(n)/η)) nϵ with probability at least 1 − 7η. Lemma 9 shows that Algorithm 4 achieves the optimal convergence rate up to poly- logarithmic factors. Compared to Cai et al...

work page 2019

[4] [4]

Let N =P k∈{0}∪A nk. For η ∈ (0, 1), under the conditions that min k∈{0}∪A nk ≳ T dlog(T /η) ∨ T log(T /(δη)) log(T /(η(ϵ ∧ δ))ϵ−1, R ≳ p d log(N/η) and Rt ≳ p log(N/η)PrivateVariance({X ⊤ τt+iβt − Yτt+i}b i=1, ϵ′, δ′), we have that P(E ′ 1 ∩ E ′ 2 ∩ E ′ 3 ∩ E ′

work page

[5] [5]

Proof of Corollary 12

≥ 1 − 6η. Proof of Corollary 12. The proof is a generalisation of the single site result in Lemma 11 to multi-site. For brevity, we only point out the differences between controlling {Ei}i∈[4] and {E ′ i}i∈[4]. For E ′ 1, we note that the population version of X k∈{0}∪A nk b(k)N b(k) X i=1 X(k) τt+iX(k)⊤ τt+i is ˜Σ = P k∈{0}∪A nkΣ(k)/N, which has λmin(˜Σ)...

work page

[6] [6]

For E ′ 2, the same arguments for controlling E2 in Lemma 11 still work, but with n by N in the choice of R to account for the union bound over N random variables

≥ 1 − η, as long as N ≳ T dlog(T /η). For E ′ 2, the same arguments for controlling E2 in Lemma 11 still work, but with n by N in the choice of R to account for the union bound over N random variables. The same arguments for E3 also works for E ′ 3 but with Σ replaced by Σ (k) where appro- priate, and notice that ∥βt − β∥Σ(k) ≲ ∥βt − β∥2 for any k. 56 For...

work page

[7] [7]

Definition 2

≥ 1 − 2η. Definition 2. Given a data set D, we say a randomised algorithm M is (ϵ, δ)-central DP with respect to a set S ⊆ D, if P(M(D) ∈ O|D) ≤ eϵP(M(D′) ∈ O|D′) for any measurable set O and any data set D′ that can be obtained by altering at most one data entry in S. We use MS ϵ,δ to denote the set of all procedures that are (ϵ, δ)-central DP with respe...

work page 2019

[8] [8]

9s′ 10s − 1 ξ2 + 22 5 L2ξ − 2 11 9 2 L4 # ≤ 1 − s s′ 5 11L2

(47) Writing fβ(y, x) as the joint density, we have fβ(y, x) = 1√ 2πσ m+n exp − Pn i=1(yi − x⊤ i β)2 +Pn+m i=n+1(yi − x⊤ i β′)2 2σ2 m+nY i=1 ϕ(xi), where ϕ(xi) is the density of N (0, I). Note that since β′ is not a function of β, we have ∂fβ(y, x) ∂β = fβ(y, x) σ2 nX i=1 (yi − x⊤ i β)xi, and therefore we have X i∈[n] EAi = X j∈[d] E {M(Y , X)}j X i∈[n] (...

work page 2019

[9] [9]

= s s log(d/η) log(n0) n0 + s log1/2(1/δ) log5/2(n0d/η) n0ϵ ,

work page

[10] [10]

= s s log(d/η) log(nA + n0) nA + n0 + h + p |A|ds log1/2(1/δ) log5/2[((nA + n0)d)/η] (nA + n0)ϵ . Case 1: When √ | ˆA|ds′ log1/2(1/δ) log5/2[((n ˆA+n0)d)/η] (n ˆA+n0)ϵ ≤ C0rHLR(n0, s′, d, ϵ, δ, η) ≲ rHLR(n0, s, d, ϵ, δ, η) and h ≤ crHLR(n0, s, d, ϵ, δ, η), where c is the constant in Proposition 18.(iii): We have

work page

[11] [11]

≲ [1], ˆA = A with probability at least 1 − η by Proposition 18.(iii), and the bound [2] follows from Proposition 19.(ii). Case 2: When q | ˆA|ds′ log1/2(1/δ) log5/2[((n ˆA + n0)d)/η] (n ˆA + n0)ϵ ≤ C0rHLR(n0, s′, d, ϵ, δ, η) and h > cr HLR(n0, s, d, ϵ, δ, η), where c is the constant in Proposition 18.(iii): [2] ≳ [1]. By Proposition 18.(ii), we know ∥β(k...

work page 2019

[12] [12]

1 2 + Cγ s s′ log(d/η) n/T + c # ∥βt − β∗∥2 Σ + C′ s′ log(d/η) n/T . (70) Similarly, Lt n(βt+1) − Lt n(β∗) ≥

Going back to (58), we have Lt n(βt+1) − Lt n(βt) ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t∥2 2 + (1 − ξ)⟨βt+1 − βt, gt⟩ 71 ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t\(St∪S)∥2 2 − ξ2 2γ ∥gt St∪S∥2 2 − 9ξ 20γ (1 − ξ)∥gt St+1∪St∥2 2 + Cs′∥wt∥2 ∞. Consider a set S′ ⊆ St\St+1 with |S′| = |I t\(St ∪ S)| = |St+1\(St ∪ S)|. A...

work page 2019

[13] [13]

1 2 + Cγ s s′ log(d/η) N/T + c # ∥βt − β(0)∥2 Σ + s′ log(d/η) N/T + h2. (86) Similarly, Lt N(βt+1) − Lt N(β(0)) ≥

Going back to (75), we have Lt N(βt+1) − Lt N(βt) ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t∥2 2 + (1 − ξ)⟨βt+1 − βt, gt⟩ ≤ 1 2 γ∥βt+1 I t − βt I t + ξ/γ · gt I t∥2 2 − ξ2 2γ ∥gt I t\(St∪S)∥2 2 − ξ2 2γ ∥gt St∪S∥2 2 − 9ξ 20γ (1 − ξ)∥gt St+1∪St∥2 2 + Cs′∥wt∥2 ∞. Consider a set S′ ⊆ St\St+1 with |S′| = |I t\(St ∪ S)| = |St+1\(St ∪ S)|. Appl...

work page 2019