Optimal Dimension-Free Sampling for Regularized Classification

Alexander Munteanu; Jeff M. Phillips; Meysam Alishahi; Simon Omlor

arxiv: 2605.23726 · v1 · pith:R3JIHOMLnew · submitted 2026-05-22 · 💻 cs.LG · cs.DS· stat.ML

Optimal Dimension-Free Sampling for Regularized Classification

Meysam Alishahi , Alexander Munteanu , Simon Omlor , Jeff M. Phillips This is my paper

Pith reviewed 2026-05-25 04:59 UTC · model grok-4.3

classification 💻 cs.LG cs.DSstat.ML

keywords dimension-free samplingregularized classificationlogistic losssampling complexityempirical risk minimizationlipschitz continuous losssensitivity sampling

0 comments

The pith

A bounded derivative condition on classification losses yields linear-in-k sampling for squared l2 regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes sampling schemes that approximate the regularized classification risk to within relative error 1±ε using a number of samples depending only on k and ε. For l2-over-k regularization the size is k²/ε² and for l1-over-k it is k/ε². When the loss g meets |g'(x)| ≤ g(x), g(0) > 0, and is monotonic or convex, squared l2-over-k regularization drops to linear in k; the same condition is necessary, since g(0) = 0 rules out any dimension-free bound. The results apply to logistic, hinge, sigmoid, and ReLU losses and are proved via uniform or norm sampling together with higher-moment and empirical-process arguments that improve on prior cubic bounds. Matching lower bounds hold up to polylog factors.

Core claim

We prove optimal sampling bounds achieving (1±ε)-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. In particular, we prove k²/ε² upper and lower bounds for ||·||₂/k regularization, and k/ε² upper and lower bounds for ||·||₁/k regularization. For ||·||₂²/k regularization, the sampling complexity depends mainly on a bounded derivative property: if |g'(x)|≤g(x), and g(0)>0, and g is monotonic or convex, then it admits linear in k sampling complexity; otherwise the general bound is k²/ε². However, if g(0)=0, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out.

What carries the argument

uniform and squared-norm sampling together with higher-moment bounds and empirical process analysis

If this is right

Logistic, sigmoid, hinge, and ReLU losses all fall under the stated linear or quadratic bounds.
The number of samples is independent of ambient dimension and number of training points.
Simple uniform or norm sampling replaces more expensive sensitivity sampling.
All upper bounds are tight up to polylogarithmic factors via explicit lower-bound constructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same derivative condition may produce improved sampling rates in regularized regression or other convex losses.
When g(0) = 0, importance weighting or adaptive sampling could be tested to recover sub-quadratic rates.
The higher-moment analysis that avoids VC-dimension overcounting might apply to approximation guarantees in other empirical-risk settings.

Load-bearing premise

The loss g must satisfy |g'(x)| ≤ g(x) with g(0) > 0 to reach the linear-in-k regime.

What would settle it

A concrete loss function with g(0) = 0 for which dimension-free sampling still produces (1±ε) relative error on some regularized problem would falsify the impossibility claim.

read the original abstract

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper pins down matching dimension-free sampling rates for regularized classification losses and improves the prior cubic bound using higher-moment arguments.

read the letter

The main thing to know is that the authors derive optimal sampling complexities that are dimension-free for a family of Lipschitz classification losses under different regularizers, with matching upper and lower bounds up to polylog factors. They get k²/ε² samples for ||·||₂/k regularization, k/ε² for ||·||₁/k, and linear in k for ||·||₂²/k when the loss g satisfies |g'(x)| ≤ g(x), g(0) > 0, and monotonicity or convexity. This covers logistic, hinge, and ReLU losses, and they show that g(0) = 0 rules out any dimension-free sampling. The technical step forward is replacing sensitivity sampling with uniform or norm sampling plus higher-moment and empirical-process bounds that avoid overcounting, which drops the dependence from the earlier k³/ε² result in their ICML'24 paper. The cases are laid out cleanly and the negative result when g(0)=0 is a useful clarification. The work is self-contained on the cited prior result and does not rely on fitted parameters. The soft spots are limited. The linear-in-k regime requires the derivative condition, which holds for the listed losses but is an extra premise; without it the bound reverts to k²/ε². The abstract describes the empirical-process steps as refined, but the precise handling of constants and edge cases in the uniform convergence arguments would need verification in the full proofs. No internal contradictions appear in the stated regimes. This paper is for researchers who need tight sample sizes when training regularized classifiers on large data. A practitioner or theorist working on sampling-based optimization would get usable rates and clear conditions to apply. It deserves a serious referee because the rates are concrete, the improvement over the prior cubic bound is explicit, and the impossibility result adds value. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims to prove optimal dimension-free sampling bounds achieving (1±ε)-relative error for a broad class of Lipschitz continuous classification loss functions (including logistic, sigmoid, hinge, and ReLU) under various regularization terms. It establishes k²/ε² upper and lower bounds for ||·||₂/k regularization, k/ε² bounds for ||·||₁/k regularization, and for ||·||₂²/k regularization a linear-in-k bound when |g'(x)|≤g(x), g(0)>0 and g monotonic or convex (otherwise k²/ε²); if g(0)=0 then no dimension-free bounds exist. All upper bounds are matched by lower bounds up to polylog factors, achieved via simple uniform or (squared) norm sampling together with higher-moment and empirical-process arguments that improve on prior k³/ε² sensitivity sampling.

Significance. If the derivations hold, the results deliver tight, dimension-free sampling guarantees for core regularized classification tasks and replace cubic sensitivity sampling with simpler uniform/norm sampling plus refined moment arguments. The explicit case distinctions, matching lower bounds, and identification of the |g'(x)|≤g(x) condition as the threshold for linear-in-k complexity constitute a clear technical advance with potential practical value for scalable high-dimensional classification.

major comments (2)

[Abstract / assumptions paragraph] The bounded-derivative condition |g'(x)|≤g(x) together with g(0)>0 and monotonicity/convexity is load-bearing for the linear-in-k regime under ||·||₂²/k regularization (abstract). The manuscript must explicitly verify that the listed losses (logistic, hinge, ReLU) satisfy it and must state the precise sampling complexity when the condition fails but g(0)>0.
[Proof sections for upper bounds] The improvement from k³/ε² to k²/ε² (or linear in k) rests on the higher-moment bounds and empirical-process analysis that avoid overcounting in the VC/sensitivity framework. The central claim therefore requires a self-contained verification that these arguments indeed produce the stated relative-error guarantees without hidden dimension dependence.

minor comments (2)

State the precise polylogarithmic factors appearing in the matching lower bounds.
Clarify whether the uniform or norm sampling procedures are fully algorithmic (including how the sampling distribution is constructed in practice) or remain existential.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract / assumptions paragraph] The bounded-derivative condition |g'(x)|≤g(x) together with g(0)>0 and monotonicity/convexity is load-bearing for the linear-in-k regime under ||·||₂²/k regularization (abstract). The manuscript must explicitly verify that the listed losses (logistic, hinge, ReLU) satisfy it and must state the precise sampling complexity when the condition fails but g(0)>0.

Authors: We agree that explicit verification strengthens the presentation. In the revised manuscript we will add a short paragraph (or appendix entry) confirming that logistic loss, hinge loss, and ReLU loss each satisfy |g'(x)| ≤ g(x), g(0) > 0, and the monotonicity/convexity requirement. When the bounded-derivative condition fails but g(0) > 0 the complexity remains k²/ε², which is already stated in the abstract and main text; we will emphasize this case distinction more clearly. revision: yes
Referee: [Proof sections for upper bounds] The improvement from k³/ε² to k²/ε² (or linear in k) rests on the higher-moment bounds and empirical-process analysis that avoid overcounting in the VC/sensitivity framework. The central claim therefore requires a self-contained verification that these arguments indeed produce the stated relative-error guarantees without hidden dimension dependence.

Authors: The proofs in Sections 3–4 are already self-contained: they apply higher-moment bounds and empirical-process concentration directly to the uniform or squared-norm sampling estimators, without invoking VC-dimension or sensitivity sampling. These steps are dimension-free by design. To make the argument more immediately verifiable we will insert a short self-contained outline (or lemma) that isolates the key inequalities guaranteeing the (1 ± ε) relative error. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core results derive from new higher-moment bounds and empirical-process arguments applied to uniform or squared-norm sampling, which are independent of the cited prior work. The self-citation to (Alishahi and Phillips, ICML'24) serves only to contrast the previous k³/ε² sensitivity bound; the claimed k²/ε², k/ε², and linear-in-k regimes follow from explicit assumptions (e.g., |g'(x)| ≤ g(x), g(0) > 0, monotonicity/convexity) plus matching lower bounds. No derivation step reduces a claimed bound to a fitted parameter, self-definition, or load-bearing self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard mathematical tools and domain assumptions about the loss functions; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Loss functions are Lipschitz continuous
Explicitly required for the (1±ε) relative-error bounds to hold across the listed losses.
standard math Standard concentration inequalities and empirical process theory apply
Invoked in the refined higher-moment arguments that replace VC-dimension/sensitivity sampling.

pith-pipeline@v0.9.0 · 5819 in / 1512 out tokens · 25757 ms · 2026-05-25T04:59:45.628436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola

URLhttp://eudml.org/doc/218383. Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola. Kernel methods in machine learning. The Annals of Statistics, 36(3):1171–1220, 2008. Arun Jambulapati, James R. Lee, Yang P. Liu, and Aaron Sidford. Sparsifying sums of norms. In 64th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 1953–1962...

work page doi:10.1214/aos/1176346079 2008
[2]

We note that 1 m Pm i=1 wi ≤2, since eachw i = 2S s(ai)+S ≤ 2S S = 2

dP(a) = s(a)+S 2S dP(a) such that if we reweight the contributions of point ai by wi =w(a i) = 2S s(ai)+S , then it holds with probability at least 1−δthat ∀x∈R d : f0(x)− 1 m mX i=1 wig(⟨ai, x⟩) ≤εf(x). We note that 1 m Pm i=1 wi ≤2, since eachw i = 2S s(ai)+S ≤ 2S S = 2. We can draw from Q via a mixture of uniform and rejection sampling. I.e., we draw a...

work page
[3]

A similar assumption as Assumption [A2] is then implied by Jensen’s inequality since Ea∼P(∥a∥2)2 ≤ Ea∼P(∥a∥2 2)≤B

= B. A similar assumption as Assumption [A2] is then implied by Jensen’s inequality since Ea∼P(∥a∥2)2 ≤ Ea∼P(∥a∥2 2)≤B. ThusE a∼P(∥a∥2)≤ √ B. We note that in any case, we will use the upper bound B≤S . This is valid since we will ensure that s(a) ≥ ∥a∥ 2 + 1, resp. s(a) ≥ ∥a∥ 2 2 + 2. So in the first case we have S = B + 1 ≥B , and in the second case, we ...

work page 1991
[4]

We start wit a few preliminaries on Gaussian processes and their analysis

general functions with∥·∥ 1-regularization. We start wit a few preliminaries on Gaussian processes and their analysis. 26 C.1 Moment bound for Gaussian processes The following moment bound follows from the so-called Dudley’s theorem and is one of our main analysis tools: Lemma C.1(Woodruff and Yasuda, 2023, slightly modified).Let( Xt)t∈T be a Gaussian pro...

work page 2023
[5]

We will use s(a) ≥ ∥a∥2 2 + 2, which is squared norm sampling and can be handled analogously to the previous norm sampling

= B <∞ for some L≥ 1, B≥ 0, which is slightly stronger as Assumption [A2] before as discussed in Section A.2. We will use s(a) ≥ ∥a∥2 2 + 2, which is squared norm sampling and can be handled analogously to the previous norm sampling. Indeed, it implies s(a) ≥ ∥a∥2 + 1 as before and under the strengthened assumption, if we set s(a) = ∥a∥2 2 + 2, we have th...

work page 1991
[6]

We have for both choices thatR(x)≤2 since (a)R(x) =∥x∥ 2 2 = 12 +P(d−1)/2 i=1 (1/ p (d−1)/2) 2 = 2 , (b)R(x) =∥x∥ 2 = p ∥x∥2 2 = √ 2<2

work page
[7]

Also note forj >(d−1)/2 thatv T j x= 1 and thusg(v T j x) = 1−1 = 0, by the assumption it also holds for all samples thata ix= 1 resp.g(a ix) =g(1) = 0

work page
[8]

These observations together with the choice ofd−1 = (k/6ε) 2 yield that   1 d−1 d−1X j=1 g(vjx) + R(x) k   = 1 d−1 d−1 2 1p (d−1) + R(x) k ≤ 6ε 2k + 2 k = 4 + 6ε 2k

Finally, forj≤(d−1)/2 we havev T j x= 1−1/ √ 2·1/ p (d−1)/2 = 1−1/ p (d−1) and thusg(v T j x) = 1−1 + 1/ p (d−1) = 1/ p (d−1). These observations together with the choice ofd−1 = (k/6ε) 2 yield that   1 d−1 d−1X j=1 g(vjx) + R(x) k   = 1 d−1 d−1 2 1p (d−1) + R(x) k ≤ 6ε 2k + 2 k = 4 + 6ε 2k . Further we have that 1 m mX i=1 wig(aix) = 1 m mX i=1 wig(1...

work page
[9]

We have for both choices thatR(x) = 1 since (a)R(x) =∥x∥ 2 2 =Pd i=1(1/ √ d)2 = 1 , (b)R(x) =∥x∥ 2 = p ∥x∥2 2 = 1

work page
[10]

Also note forj > d/2 thatv T j x= 0 and thusg(v T j x) =g(0) = 0, by the assumption it also holds for all samples thata ix= 0 resp.g(a ix) =g(0) = 0

work page
[11]

These observations together with the choice ofd= (k/6ε) 2 yield that  1 d dX j=1 g(vjx) + R(x) k   = 1 d d 2 1√ d + R(x) k ≤ 6ε 2k + 1 k = 2 + 6ε 2k

Finally, forj≤d/2 we havev T j x=−1/ √ dand thusg(v T j x) = 1/ √ d. These observations together with the choice ofd= (k/6ε) 2 yield that  1 d dX j=1 g(vjx) + R(x) k   = 1 d d 2 1√ d + R(x) k ≤ 6ε 2k + 1 k = 2 + 6ε 2k . Further we have that 1 m mX i=1 wig(aix) = 1 m mX i=1 wig(0) = 0. Usingε <1/4 we conclude that 1 m mX i=1 wig(aix)− 1 d dX j=1 g(vjx)...

work page
[12]

For either ReLU/hinge/logistic loss with R(·) = ∥·∥1 regularization, m = Ω( ε−2klogk )is required,

work page
[13]

For either sigmoid loss with R(·) = ∥·∥1, or logistic loss with R(·) = ∥·∥2 2 regularization, m= Ω(ε −2k/logk)is required,

work page
[14]

For sigmoid loss withR(·) =∥·∥ 2 2 regularization,m= Ω(ε −2k/(logk) 3)is required. Proof. 1. For ReLU loss with R(·) = ∥·∥1 regularization, m = Ω(ε−2klogk ) is proven in Lemma D.6. For hinge/logistic with the same regularizer, the bound follows by reduction from the case of ReLU, treated in Lemma D.5. 2. For sigmoid loss with R(·) = ∥·∥1, m = Ω( ε−2k/logk...

work page 1983
[15]

For either the logistic loss gL(t) = ln(1 + e−t)or hinge-loss gH(t) = max{0, 1−t} and regularizer term R(x) ∈ {∥x∥1,∥x∥ 2}, there exists a distribution PonR d such thatanyimportance-sampled approximation that attains ˆf0(x)−f 0(x) ≤εf(x),∀x∈R d, with probability at least1−δmust have size m= Ω klog(k/δ) ε2 . Proof. Note that lim β7→∞ gL(βt) β = lim β7→∞ gH...

work page
[16]

For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x)

Consider the logistic loss g(r) = ln(1+ e−r). For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x). Then there exists a distribution P on Rd such that the following holds. Any i.i.d. impor- tance–weighted approximation ˆfof sizemthat satisfies Pr h sup x∈Rd ˆf(x)−f(x) ≤ε f(x) i ≥1−δ for everyk≥4, accurac...

work page 1983
[17]

For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x)

Consider the sigmoid loss g(r) = 1/(1 +er). For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x). Then there exists a distribution P on Rd such that the following holds. Any i.i.d. impor- tance–weighted approximation ˆfof sizemthat satisfies Pr h sup x∈Rd ˆf(x)−f(x) ≤ε f(x) i ≥1−δ for everyk≥4, accuracy ...

work page 1983
[18]

E nX k=1 rkxk q#1/q ≤C p,q

Then we establish an Ω( NlogN ) lower bound with more sophisticated techniques. We summarize both results in the following theorem. Theorem 3.7.Consider the ReLU function where g(r) = max{0,−r} , and the R(x) = ∥x∥2 2 regularizer. For any n≥d there exist distributions supported on n points in Rd, such that with high probabilitym= Ω(nlogn)samples are requi...

work page 1961

[1] [1]

Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola

URLhttp://eudml.org/doc/218383. Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola. Kernel methods in machine learning. The Annals of Statistics, 36(3):1171–1220, 2008. Arun Jambulapati, James R. Lee, Yang P. Liu, and Aaron Sidford. Sparsifying sums of norms. In 64th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 1953–1962...

work page doi:10.1214/aos/1176346079 2008

[2] [2]

We note that 1 m Pm i=1 wi ≤2, since eachw i = 2S s(ai)+S ≤ 2S S = 2

dP(a) = s(a)+S 2S dP(a) such that if we reweight the contributions of point ai by wi =w(a i) = 2S s(ai)+S , then it holds with probability at least 1−δthat ∀x∈R d : f0(x)− 1 m mX i=1 wig(⟨ai, x⟩) ≤εf(x). We note that 1 m Pm i=1 wi ≤2, since eachw i = 2S s(ai)+S ≤ 2S S = 2. We can draw from Q via a mixture of uniform and rejection sampling. I.e., we draw a...

work page

[3] [3]

A similar assumption as Assumption [A2] is then implied by Jensen’s inequality since Ea∼P(∥a∥2)2 ≤ Ea∼P(∥a∥2 2)≤B

= B. A similar assumption as Assumption [A2] is then implied by Jensen’s inequality since Ea∼P(∥a∥2)2 ≤ Ea∼P(∥a∥2 2)≤B. ThusE a∼P(∥a∥2)≤ √ B. We note that in any case, we will use the upper bound B≤S . This is valid since we will ensure that s(a) ≥ ∥a∥ 2 + 1, resp. s(a) ≥ ∥a∥ 2 2 + 2. So in the first case we have S = B + 1 ≥B , and in the second case, we ...

work page 1991

[4] [4]

We start wit a few preliminaries on Gaussian processes and their analysis

general functions with∥·∥ 1-regularization. We start wit a few preliminaries on Gaussian processes and their analysis. 26 C.1 Moment bound for Gaussian processes The following moment bound follows from the so-called Dudley’s theorem and is one of our main analysis tools: Lemma C.1(Woodruff and Yasuda, 2023, slightly modified).Let( Xt)t∈T be a Gaussian pro...

work page 2023

[5] [5]

We will use s(a) ≥ ∥a∥2 2 + 2, which is squared norm sampling and can be handled analogously to the previous norm sampling

= B <∞ for some L≥ 1, B≥ 0, which is slightly stronger as Assumption [A2] before as discussed in Section A.2. We will use s(a) ≥ ∥a∥2 2 + 2, which is squared norm sampling and can be handled analogously to the previous norm sampling. Indeed, it implies s(a) ≥ ∥a∥2 + 1 as before and under the strengthened assumption, if we set s(a) = ∥a∥2 2 + 2, we have th...

work page 1991

[6] [6]

We have for both choices thatR(x)≤2 since (a)R(x) =∥x∥ 2 2 = 12 +P(d−1)/2 i=1 (1/ p (d−1)/2) 2 = 2 , (b)R(x) =∥x∥ 2 = p ∥x∥2 2 = √ 2<2

work page

[7] [7]

Also note forj >(d−1)/2 thatv T j x= 1 and thusg(v T j x) = 1−1 = 0, by the assumption it also holds for all samples thata ix= 1 resp.g(a ix) =g(1) = 0

work page

[8] [8]

These observations together with the choice ofd−1 = (k/6ε) 2 yield that   1 d−1 d−1X j=1 g(vjx) + R(x) k   = 1 d−1 d−1 2 1p (d−1) + R(x) k ≤ 6ε 2k + 2 k = 4 + 6ε 2k

Finally, forj≤(d−1)/2 we havev T j x= 1−1/ √ 2·1/ p (d−1)/2 = 1−1/ p (d−1) and thusg(v T j x) = 1−1 + 1/ p (d−1) = 1/ p (d−1). These observations together with the choice ofd−1 = (k/6ε) 2 yield that   1 d−1 d−1X j=1 g(vjx) + R(x) k   = 1 d−1 d−1 2 1p (d−1) + R(x) k ≤ 6ε 2k + 2 k = 4 + 6ε 2k . Further we have that 1 m mX i=1 wig(aix) = 1 m mX i=1 wig(1...

work page

[9] [9]

We have for both choices thatR(x) = 1 since (a)R(x) =∥x∥ 2 2 =Pd i=1(1/ √ d)2 = 1 , (b)R(x) =∥x∥ 2 = p ∥x∥2 2 = 1

work page

[10] [10]

Also note forj > d/2 thatv T j x= 0 and thusg(v T j x) =g(0) = 0, by the assumption it also holds for all samples thata ix= 0 resp.g(a ix) =g(0) = 0

work page

[11] [11]

These observations together with the choice ofd= (k/6ε) 2 yield that  1 d dX j=1 g(vjx) + R(x) k   = 1 d d 2 1√ d + R(x) k ≤ 6ε 2k + 1 k = 2 + 6ε 2k

Finally, forj≤d/2 we havev T j x=−1/ √ dand thusg(v T j x) = 1/ √ d. These observations together with the choice ofd= (k/6ε) 2 yield that  1 d dX j=1 g(vjx) + R(x) k   = 1 d d 2 1√ d + R(x) k ≤ 6ε 2k + 1 k = 2 + 6ε 2k . Further we have that 1 m mX i=1 wig(aix) = 1 m mX i=1 wig(0) = 0. Usingε <1/4 we conclude that 1 m mX i=1 wig(aix)− 1 d dX j=1 g(vjx)...

work page

[12] [12]

For either ReLU/hinge/logistic loss with R(·) = ∥·∥1 regularization, m = Ω( ε−2klogk )is required,

work page

[13] [13]

For either sigmoid loss with R(·) = ∥·∥1, or logistic loss with R(·) = ∥·∥2 2 regularization, m= Ω(ε −2k/logk)is required,

work page

[14] [14]

For sigmoid loss withR(·) =∥·∥ 2 2 regularization,m= Ω(ε −2k/(logk) 3)is required. Proof. 1. For ReLU loss with R(·) = ∥·∥1 regularization, m = Ω(ε−2klogk ) is proven in Lemma D.6. For hinge/logistic with the same regularizer, the bound follows by reduction from the case of ReLU, treated in Lemma D.5. 2. For sigmoid loss with R(·) = ∥·∥1, m = Ω( ε−2k/logk...

work page 1983

[15] [15]

For either the logistic loss gL(t) = ln(1 + e−t)or hinge-loss gH(t) = max{0, 1−t} and regularizer term R(x) ∈ {∥x∥1,∥x∥ 2}, there exists a distribution PonR d such thatanyimportance-sampled approximation that attains ˆf0(x)−f 0(x) ≤εf(x),∀x∈R d, with probability at least1−δmust have size m= Ω klog(k/δ) ε2 . Proof. Note that lim β7→∞ gL(βt) β = lim β7→∞ gH...

work page

[16] [16]

For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x)

Consider the logistic loss g(r) = ln(1+ e−r). For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x). Then there exists a distribution P on Rd such that the following holds. Any i.i.d. impor- tance–weighted approximation ˆfof sizemthat satisfies Pr h sup x∈Rd ˆf(x)−f(x) ≤ε f(x) i ≥1−δ for everyk≥4, accurac...

work page 1983

[17] [17]

For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x)

Consider the sigmoid loss g(r) = 1/(1 +er). For p∈ { 1, 2} andq∈ {1,2}, set Rp q(x) := ∥x∥p q k , f 0(x) :=E a∼P g(⟨a, x⟩) , f(x) :=f 0(x) +R p q(x). Then there exists a distribution P on Rd such that the following holds. Any i.i.d. impor- tance–weighted approximation ˆfof sizemthat satisfies Pr h sup x∈Rd ˆf(x)−f(x) ≤ε f(x) i ≥1−δ for everyk≥4, accuracy ...

work page 1983

[18] [18]

E nX k=1 rkxk q#1/q ≤C p,q

Then we establish an Ω( NlogN ) lower bound with more sophisticated techniques. We summarize both results in the following theorem. Theorem 3.7.Consider the ReLU function where g(r) = max{0,−r} , and the R(x) = ∥x∥2 2 regularizer. For any n≥d there exist distributions supported on n points in Rd, such that with high probabilitym= Ω(nlogn)samples are requi...

work page 1961