Sparse Max-Affine Regression

Haitham Kanj; Kiryung Lee; Seonho Kim

arxiv: 2411.02225 · v3 · submitted 2024-11-04 · 📊 stat.ML · cs.IT· cs.LG· math.IT· math.ST· stat.TH

Sparse Max-Affine Regression

Haitham Kanj , Seonho Kim , Kiryung Lee This is my paper

Pith reviewed 2026-05-23 17:36 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.ITmath.STstat.TH

keywords sparse regressionmax-affine modelsgradient descentvariable selectionpiecewise linear regressionnon-asymptotic convergencesparse PCA

0 comments

The pith

Sparse gradient descent recovers the parameters of a sparse max-affine model exactly from O(s log(d/s)) noise-free observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sparse Gradient Descent (Sp-GD) to solve variable selection for convex piecewise linear regression expressed as the maximum of k affine functions. It supplies a non-asymptotic local convergence guarantee showing that fixed model order yields an ε-accurate estimate from O(max(ε^{-2} σ_z², 1) s log(d/s)) samples under sub-Gaussian noise obeying anti-concentration. The same bound implies exact recovery from O(s log(d/s)) samples when noise is absent. An initialization procedure based on sparse principal component analysis followed by covering search receives its own non-asymptotic guarantee, and a Real Maslov Dequantization map is introduced that converts sparse generalized polynomials into max-affine form with exponentially decaying error.

Core claim

When the model order and parameters are fixed, Sp-GD provides an ε-accurate estimate given O(max(ε^{-2}σ_z²,1)s log(d/s)) observations under sub-Gaussian noise with anti-concentration, implying exact recovery from O(s log(d/s)) noise-free observations.

What carries the argument

Sparse Gradient Descent (Sp-GD) that iteratively refines sparse weight vectors inside the max-affine representation while performing variable selection.

If this is right

Sp-GD converges locally to the ground-truth parameters at the stated sample rate.
The sparse-PCA initialization returns an ε-accurate starting point after O(ε^{-2} max(σ_z^4, σ_z², 1) s² log⁴(d)) Gaussian samples.
Real Maslov Dequantization maps sparse generalized polynomials to max-affine models with error that decays exponentially in the temperature parameter.
All convergence guarantees extend directly to the bounded-noise regime created by the dequantization map.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The exponential decay rate of the dequantization error suggests the same transformation could approximate other classes of sparse non-smooth functions.
If anti-concentration can be verified for a wider family of covariate distributions, the same Sp-GD analysis would immediately cover additional piecewise-linear regression settings.
Allowing the number of affine pieces k to grow slowly with dimension would be a direct next step that preserves the logarithmic dependence on d.

Load-bearing premise

The covariate distribution must obey sub-Gaussianity together with anti-concentration for the local convergence proof to apply.

What would settle it

Run Sp-GD on noise-free data drawn from a distribution satisfying the stated assumptions and check whether recovery error stays bounded away from zero once the sample count drops below c s log(d/s) for small constant c.

Figures

Figures reproduced from arXiv: 2411.02225 by Haitham Kanj, Kiryung Lee, Seonho Kim.

**Figure 1.** Figure 1: Median of err(θb) for different (n,d) pairs using 50 Monte Carlo iterations for k = 3 and s = 25 with Gaussian (top) and Uniform (bottom) covariate distributions in the noiseless case. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_1.png] view at source ↗

**Figure 2.** Figure 2: Median of err(θb) for different (n,s) pairs using 50 Monte Carlo iterations for k = 3 and d = 200 with Gaussian (left) and Uniform (right) covariate distributions. The red curves are fitted with respect to s log(d/s) at the phase transition boundary for both figures. noise variance σ 2 z when the total number of covariates, the number of active covariates, and the model order are fixed to d = 200, s = 50 a… view at source ↗

**Figure 3.** Figure 3: Median of err(θb) for different (n,σ 2 z ) pairs using 50 Monte Carlo iterations for s = 50, d = 200 and k = 3 with Gaussian covariates and local initial estimate. 10 20 30 40 50 60 70 0 0.5 1 1.5 2 2.5 [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Projection error difference using PCA (red), and Alg [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Parameter estimation error using PCA (red) and Alg [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of tail probabilities as a function of t [PITH_FULL_IMAGE:figures/full_fig_p042_6.png] view at source ↗

read the original abstract

This paper presents Sparse Gradient Descent as a solution for variable selection in convex piecewise linear regression, where the model is given as the maximum of $k$-affine functions $ x \mapsto \max_{j \in [k]} \langle a_j^\star, x \rangle + b_j^\star$ for $j = 1,\dots,k$. Here, $\{ a_j^\star\}_{j=1}^k$ and $\{b_j^\star\}_{j=1}^k$ denote the ground-truth weight vectors and intercepts. A non-asymptotic local convergence analysis is provided for Sp-GD under sub-Gaussian noise when the covariate distribution satisfies the sub-Gaussianity and anti-concentration properties. When the model order and parameters are fixed, Sp-GD provides an $\epsilon$-accurate estimate given $\mathcal{O}(\max(\epsilon^{-2}\sigma_z^2,1)s\log(d/s))$ observations where $\sigma_z^2$ denotes the noise variance. This also implies the exact parameter recovery by Sp-GD from $\mathcal{O}(s\log(d/s))$ noise-free observations. The proposed initialization scheme uses sparse principal component analysis to estimate the subspace spanned by $\{ a_j^\star\}_{j=1}^k$, then applies an $r$-covering search to estimate the model parameters. A non-asymptotic analysis is presented for this initialization scheme when the covariates and noise samples follow Gaussian distributions. When the model order and parameters are fixed, this initialization scheme provides an $\epsilon$-accurate estimate given $\mathcal{O}(\epsilon^{-2}\max(\sigma_z^4,\sigma_z^2,1)s^2\log^4(d))$ observations. A new transformation named Real Maslov Dequantization (RMD) is proposed to transform sparse generalized polynomials into sparse max-affine models. The error decay rate of RMD is shown to be exponentially small in its temperature parameter. Furthermore, theoretical guarantees for Sp-GD are extended to the bounded noise model induced by RMD. Numerical Monte Carlo results corroborate theoretical findings for Sp-GD and the initialization scheme.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sp-GD gives a local convergence result for sparse max-affine regression but the initialization bound leaves a gap with the noise-free exact recovery claim.

read the letter

Sp-GD and the RMD transformation are the main new elements here, with non-asymptotic bounds that seem carefully derived under the stated assumptions, but the initialization complexity creates a gap with the claimed noise-free rates. The work targets sparse max-affine regression, which is max of k affine functions with sparse weights. They give Sp-GD, a gradient method tailored for this, with local convergence analysis when covariates are sub-Gaussian with anti-concentration. The bound is O(max(ε^{-2}σ_z²,1) s log(d/s)) for fixed k. They also have an init scheme using sparse PCA to find the subspace then r-covering search, analyzed under Gaussian, with bound involving s² log^4(d). RMD turns sparse polys into max-affine with exp small error in temp param, and they extend the guarantees. This is useful progress if the bounds check out, and the Monte Carlo results help. The citation pattern looks standard for the area. The concern about init not supporting the O(s log(d/s)) exact recovery in noise-free case does seem to land from the abstract. When σ_z=0, init still has the s² factor, and no mention of how it improves. That makes the overall sample complexity claim shaky until addressed. The anti-concentration is a strong assumption too, but at least stated. Readers in sparse regression and non-asymptotic stats would get value from the specific rates and the RMD idea. The paper shows honest engagement with the math, so it should go to peer review to verify the derivations and close any gaps.

Referee Report

1 major / 1 minor

Summary. The paper introduces Sparse Gradient Descent (Sp-GD) for sparse max-affine regression, where the target is the pointwise maximum of k affine functions. It provides a non-asymptotic local convergence analysis of Sp-GD under sub-Gaussian noise when covariates satisfy sub-Gaussianity and anti-concentration, claiming that (with k and parameters fixed) an ε-accurate estimate is obtained from O(max(ε^{-2}σ_z²,1) s log(d/s)) samples; this is said to imply exact recovery from O(s log(d/s)) noise-free observations. A separate initialization procedure based on sparse PCA followed by r-covering search is analyzed under Gaussian covariates/noise and shown to require O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) samples. The paper also introduces Real Maslov Dequantization (RMD) to convert sparse generalized polynomials into max-affine models (with exponentially small approximation error in the temperature parameter) and extends the Sp-GD guarantees to the induced bounded-noise setting. Monte Carlo experiments are reported to corroborate the theory.

Significance. If the central sample-complexity claims can be made internally consistent, the non-asymptotic local-convergence result for Sp-GD and the accompanying initialization analysis would constitute a concrete contribution to the theory of sparse piecewise-linear regression in high dimensions. The RMD transformation is a novel technical device whose exponential error decay is a clear strength. The work supplies explicit dependence on sparsity s, dimension d, and noise level, which is useful when such rates are needed for downstream algorithm design.

major comments (1)

[Abstract] Abstract: the claim that Sp-GD yields exact recovery from O(s log(d/s)) noise-free observations is not supported by the given initialization bound. The initialization analysis (under Gaussian covariates) requires O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) samples and is not shown to improve to O(s log(d/s)) when σ_z=0. Because the Sp-GD guarantee is local, the overall procedure cannot achieve the stated noise-free rate unless the initialization succeeds at that rate.

minor comments (1)

The abstract refers to 'non-asymptotic analyses' for both Sp-GD and initialization without citing specific theorem or proposition numbers, making it difficult to locate the precise statements that support the displayed sample-complexity expressions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying an important inconsistency in the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Sp-GD yields exact recovery from O(s log(d/s)) noise-free observations is not supported by the given initialization bound. The initialization analysis (under Gaussian covariates) requires O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) samples and is not shown to improve to O(s log(d/s)) when σ_z=0. Because the Sp-GD guarantee is local, the overall procedure cannot achieve the stated noise-free rate unless the initialization succeeds at that rate.

Authors: We agree that the referee has identified a genuine inconsistency. The local convergence analysis of Sp-GD (under sub-Gaussian covariates and noise) shows that an ε-accurate estimate is obtained from O(max(ε^{-2}σ_z²,1) s log(d/s)) samples, which specializes to O(s log(d/s)) samples for exact recovery when σ_z=0. However, this is a local result that assumes the iterate begins inside a basin of attraction. The separate initialization analysis is performed under Gaussian assumptions and yields the stated O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) bound, which remains O(ε^{-2} s² log^4(d)) even when σ_z=0 and therefore does not support the noise-free claim. We will revise the abstract to remove the implication that the full procedure (initialization plus Sp-GD) achieves exact recovery from O(s log(d/s)) noise-free observations, and will instead present the local convergence rate and the initialization rate as separate results, noting that global guarantees depend on the quality of the initializer. revision: yes

Circularity Check

0 steps flagged

No circularity; independent analyses under stated assumptions

full rationale

The provided abstract and context show separate non-asymptotic local convergence analysis for Sp-GD (yielding O(max(ε^{-2}σ_z²,1)s log(d/s)) under sub-Gaussian anti-concentration) and a distinct Gaussian analysis for the sparse-PCA + r-covering initialization (O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d))). These are derived from explicit distributional assumptions rather than any self-definition, fitted-input-as-prediction, or load-bearing self-citation chain. No quoted step reduces the claimed sample complexity or exact-recovery implication to a tautology by construction. The derivation chain is self-contained against the stated external assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The analysis rests on domain assumptions about noise and covariates plus the new RMD construction; no free parameters are fitted to data in the stated rates, and no new physical entities are postulated.

axioms (2)

domain assumption Covariates are sub-Gaussian and satisfy anti-concentration properties
Invoked for the non-asymptotic local convergence of Sp-GD under sub-Gaussian noise.
domain assumption Noise is sub-Gaussian with variance σ_z²
Used to derive the sample complexity O(max(ε^{-2}σ_z²,1)s log(d/s)).

invented entities (1)

Real Maslov Dequantization (RMD) no independent evidence
purpose: Transforms sparse generalized polynomials into sparse max-affine models with exponentially small error in temperature parameter
New transformation proposed to extend theoretical guarantees to bounded noise models.

pith-pipeline@v0.9.0 · 5929 in / 1507 out tokens · 30878 ms · 2026-05-23T17:36:15.034740+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
stat.ML 2026-05 unverdicted novelty 7.0

ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is min...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

K. A. Chandrasekher, M. Lou, and A. Pananjady. Alternating m inimization for generalized rank one matrix sensing: Sharp predictions from a random ini tialization. arXiv preprint arXiv:2207.09660,

work page arXiv
[2]

Optimal Bounds on the VC-dimension

M. Csikos, A. Kupavskii, and N. H. Mustafa. Optimal bounds on the vc-dimension. arXiv preprint arXiv:1807.07924,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

S. S. Dey, R. Mazumder, and G. Wang. A convex integer programm ing approach for optimal sparse pca. arXiv preprint arXiv:1810.09062 ,

work page arXiv
[4]

Exact and approximation algorithms for sparse pca

Y. Li and W. Xie. Exact and approximation algorithms for spar se pca. arXiv preprint arXiv:2008.12438,

work page arXiv 2008
[5]

URL https://proceedings.neurips.cc/paper_files/paper/2014/file/ 74563ba21a90da13dacf2a73e3ddefa7-Paper.pdf . L. Wasserman and K. Roeder. High-dimensional variable sele ction. The Annals of Statis- tics, pages 2178–2201,

work page 2014
[6]

X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of man y random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

(79) Proof

Then for all ǫ∈ [0, 1], there exists an absolute constant C > 0 where it holds with probability at least 1− δ that sup U ∈Z s      ˜Π U n∑ i=1 ωi ([xi; 1][xi; 1]T− Id+1) ˜Π U     ≤ ℓǫ (78) if l≥ C(η∨ 1)4ǫ− 2 [ s log (d s ) + log (1 δ )] . (79) Proof. The proof of the lemma relies on the unitary invariance of the spectral norm. Let {ri}n i=1 be i...

work page 2019
[8]

(90) Proof

min [ s log (n∨ d s ) + log (k δ )] . (90) Proof. The proof of this lemma is a direct application of the union bo und by inﬂating the probability of error by |Zs|= (d s ) ≤ (ed s )s to the statement in [Kim and Lee, 2024, Lemma 7.7]. The next lemma provides an upper bound of a noise-related ter m that will appear in the proof of the main theorem. Lemma B....

work page 2024
[9]

Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1

and α ∈ (0, 1). Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1. Then, with probability at least 1− δ we have sup I:|I|≤αn U ∈Z s λ 1 [ ˜Π U ( 1 n ∑ i∈I [xi; 1][xi; 1]⊤ ) ˜Π ⊤ U ] ≤ C(η2∨ 1)√ α, (92) if n≥ α − 1 [s log(d/s ) + log(1/δ )] . (93) Proof. For ﬁxed U ∈Zs, (92) follows directly from [Tan and Vershynin, 2019...

work page 2019
[10]

Again, comparing this tail probability with the last inequality yields the assert ion in this lemma

for all t > 0 where the second inequality follows from the tail probabil ity of any sub- exponential random variable [Vershynin, 2018, Propositio n 2.7.1]. Again, comparing this tail probability with the last inequality yields the assert ion in this lemma. A simple graph- ical example is provided in Figure 6, assuming that Cmax j∈ [k] ∥Xj∥ψ 1 = 1, where w...

work page 2018
[11]

Noti ce that xi d∼ xiri for every i∈ [n] by the symmetry of the Gaussian distribution

(105) Let{ri}n i=1 be independent copies of a Rademacher random variable. Noti ce that xi d∼ xiri for every i∈ [n] by the symmetry of the Gaussian distribution. Therefore un der the event 44 n⋂ i=1 Fi, it holds that P (     1 n n∑ i=1 zi 5σz √ log n xi      ∞ ≥ t ) = P (     1 n n∑ i=1 zi 5σz √ log n xiri      ∞ ≥ t ) ≤ 2P (     1...

work page 2013
[12]

Deﬁnition D.1

We proceed in a similar fashion and decompose the diﬀerence of the second moments as ˆM2− M2 = 1 n n∑ i=1 max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i− Id)− E [ max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i− Id) ] + 1 n n∑ i=1 zi(xixT i− Id) = 1 n n∑ i=1 ( max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i )− E [ max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i ) ])    Q1 − Id· 1 n n∑ i=1 ( max j∈ [k]...

work page 2023
[13]

(114) 46 We observe that one valid pair for Young’s inequality is ( p, q ) = (α +θ α , α +θ θ ) . Therefore, we have that E [ exp { (XY )(α +θ)− 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 }] ≤ E [ exp { α α + θ· Xα − 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 } ·exp { θ α + θ· Yθ− 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 }] ≤ E [ α α + θ·exp { Xα − 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 } + θ α + θ·exp { Yθ− 1 ∥X∥ψ α − 1 ∥Y...

work page 2022
[14]

Then we have for all i∈ [n] that ∥zi ([xi]l[xi]m− δlm)∥ψ 2/ 3 ≤∥ zi∥ψ 2∥xi∥2 ψ 2≤ Cσz

(121) Next, for some l, m ∈ [d], let the Kronecker delta δlm = 1 only when l = m and 0 otherwise. Then we have for all i∈ [n] that ∥zi ([xi]l[xi]m− δlm)∥ψ 2/ 3 ≤∥ zi∥ψ 2∥xi∥2 ψ 2≤ Cσz. (122) where the ﬁrst inequality follows by Corollary D.3. Therefo re, using [Zhang and Wei, 2022, Proposition 3] with the union bound over d2 entries, we have that P (  ...

work page 2022
[15]

Once again R > 0 can be made suﬃciently small so that the right-hand side of (125) deﬁned as ̺ is arbitrarily small

≜ ̺, (125) where (i) follows from [Kim and Lee, 2024, Lemma 7.5]; (ii) ho lds since a log1/ 2(2/a ) is monotone increasing for a∈ (0, 1]; (iii) follows from the fact that a≤ b 2 log− 1/ 2(1/b ) implies a log1/ 2(2/a )≤ b for b∈ (0, 0.1]; (iv) holds since π⋆ j≥ πmin for all j∈ [k] by the deﬁntion of πmin in (16); and (v) holds since πmin≤ 1 k . Once again ...

work page 2024
[16]

Finally, we can write the bound π⋆ j≥ P(x∈C j∩C ⋆ j )≥ (1− 2̺ 1− ̺ ) πj

min πjk ≤ π⋆ j πj · R2ζπ 1+2ζ− 1 min k ≤ π⋆ j πj · R2ζ k2(1+ζ− 1)≤ ̺ 1− ̺ , (127) 49 where the last inequality follows from the deﬁnition of ̺ in (125) and the bound derived in (126). Finally, we can write the bound π⋆ j≥ P(x∈C j∩C ⋆ j )≥ (1− 2̺ 1− ̺ ) πj. (128) Combining (126) and (128) provides the assertion in (85) thu s concluding the proof. D.4 Proof...

work page 2018
[17]

, ωn) > ǫ     ≤ δ, (136) if n≥ C(η∨ 1)4ǫ− 2 [ sk log (n s ) + s log (d s ) + log (1 δ )] which concludes the proof

Then, using the union bound, and inﬂating the probability of error δ by the worst-case cardinality in (133), we obtain that P     sup x′ 1,..., x′ n {ω i}n i=1∈H (Pk,d,s , {x′ i}n i=1) f (ω 1, . . . , ωn) > ǫ     ≤ δ, (136) if n≥ C(η∨ 1)4ǫ− 2 [ sk log (n s ) + s log (d s ) + log (1 δ )] which concludes the proof. D.5 Proof of Lemma B.3 By the deﬁn...

work page 2023
[18]

For ﬁxed U∈Z s, we have that Sv, v⋆∈P 2,d (U ) and similar to (131) we have sup x1,..., xn |H(P2,d (U ),{xi}n i=1)|≤ ( en s + 1 )2(s+1)

min k2 , (139) when the sample complexity satisﬁes (84). For ﬁxed U∈Z s, we have that Sv, v⋆∈P 2,d (U ) and similar to (131) we have sup x1,..., xn |H(P2,d (U ),{xi}n i=1)|≤ ( en s + 1 )2(s+1) . (140) 51 Therefore, for ﬁxed U∈Z s we have by [Kim and Lee, 2023, Lemma 6.3] with probability at least 1− δ that sup (v, v⋆ )∈M ⏐ ⏐ ⏐ ⏐ ⏐ 1 n n∑ i=1 1{ξi∈S v, v⋆ ...

work page 2023
[19]

Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1

and α ∈ (0, 1). Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1. Then, with probability at least 1− δ we have sup I:|I|≤αn U ∈Z s      1 n ∑ i∈I Π U xi      2 ≤ C(η2∨ 1)√ α, (160) if n≥ α − 1 [s log(d/s ) + log(1/δ )] . (161) Proof. AssumeU is ﬁxed and let x′ = [ x]U ∈ Rs. Deﬁne the collection of all possibl...

work page 2019

[1] [1]

K. A. Chandrasekher, M. Lou, and A. Pananjady. Alternating m inimization for generalized rank one matrix sensing: Sharp predictions from a random ini tialization. arXiv preprint arXiv:2207.09660,

work page arXiv

[2] [2]

Optimal Bounds on the VC-dimension

M. Csikos, A. Kupavskii, and N. H. Mustafa. Optimal bounds on the vc-dimension. arXiv preprint arXiv:1807.07924,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

S. S. Dey, R. Mazumder, and G. Wang. A convex integer programm ing approach for optimal sparse pca. arXiv preprint arXiv:1810.09062 ,

work page arXiv

[4] [4]

Exact and approximation algorithms for sparse pca

Y. Li and W. Xie. Exact and approximation algorithms for spar se pca. arXiv preprint arXiv:2008.12438,

work page arXiv 2008

[5] [5]

URL https://proceedings.neurips.cc/paper_files/paper/2014/file/ 74563ba21a90da13dacf2a73e3ddefa7-Paper.pdf . L. Wasserman and K. Roeder. High-dimensional variable sele ction. The Annals of Statis- tics, pages 2178–2201,

work page 2014

[6] [6]

X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of man y random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

(79) Proof

Then for all ǫ∈ [0, 1], there exists an absolute constant C > 0 where it holds with probability at least 1− δ that sup U ∈Z s      ˜Π U n∑ i=1 ωi ([xi; 1][xi; 1]T− Id+1) ˜Π U     ≤ ℓǫ (78) if l≥ C(η∨ 1)4ǫ− 2 [ s log (d s ) + log (1 δ )] . (79) Proof. The proof of the lemma relies on the unitary invariance of the spectral norm. Let {ri}n i=1 be i...

work page 2019

[8] [8]

(90) Proof

min [ s log (n∨ d s ) + log (k δ )] . (90) Proof. The proof of this lemma is a direct application of the union bo und by inﬂating the probability of error by |Zs|= (d s ) ≤ (ed s )s to the statement in [Kim and Lee, 2024, Lemma 7.7]. The next lemma provides an upper bound of a noise-related ter m that will appear in the proof of the main theorem. Lemma B....

work page 2024

[9] [9]

Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1

and α ∈ (0, 1). Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1. Then, with probability at least 1− δ we have sup I:|I|≤αn U ∈Z s λ 1 [ ˜Π U ( 1 n ∑ i∈I [xi; 1][xi; 1]⊤ ) ˜Π ⊤ U ] ≤ C(η2∨ 1)√ α, (92) if n≥ α − 1 [s log(d/s ) + log(1/δ )] . (93) Proof. For ﬁxed U ∈Zs, (92) follows directly from [Tan and Vershynin, 2019...

work page 2019

[10] [10]

Again, comparing this tail probability with the last inequality yields the assert ion in this lemma

for all t > 0 where the second inequality follows from the tail probabil ity of any sub- exponential random variable [Vershynin, 2018, Propositio n 2.7.1]. Again, comparing this tail probability with the last inequality yields the assert ion in this lemma. A simple graph- ical example is provided in Figure 6, assuming that Cmax j∈ [k] ∥Xj∥ψ 1 = 1, where w...

work page 2018

[11] [11]

Noti ce that xi d∼ xiri for every i∈ [n] by the symmetry of the Gaussian distribution

(105) Let{ri}n i=1 be independent copies of a Rademacher random variable. Noti ce that xi d∼ xiri for every i∈ [n] by the symmetry of the Gaussian distribution. Therefore un der the event 44 n⋂ i=1 Fi, it holds that P (     1 n n∑ i=1 zi 5σz √ log n xi      ∞ ≥ t ) = P (     1 n n∑ i=1 zi 5σz √ log n xiri      ∞ ≥ t ) ≤ 2P (     1...

work page 2013

[12] [12]

Deﬁnition D.1

We proceed in a similar fashion and decompose the diﬀerence of the second moments as ˆM2− M2 = 1 n n∑ i=1 max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i− Id)− E [ max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i− Id) ] + 1 n n∑ i=1 zi(xixT i− Id) = 1 n n∑ i=1 ( max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i )− E [ max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i ) ])    Q1 − Id· 1 n n∑ i=1 ( max j∈ [k]...

work page 2023

[13] [13]

(114) 46 We observe that one valid pair for Young’s inequality is ( p, q ) = (α +θ α , α +θ θ ) . Therefore, we have that E [ exp { (XY )(α +θ)− 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 }] ≤ E [ exp { α α + θ· Xα − 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 } ·exp { θ α + θ· Yθ− 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 }] ≤ E [ α α + θ·exp { Xα − 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 } + θ α + θ·exp { Yθ− 1 ∥X∥ψ α − 1 ∥Y...

work page 2022

[14] [14]

Then we have for all i∈ [n] that ∥zi ([xi]l[xi]m− δlm)∥ψ 2/ 3 ≤∥ zi∥ψ 2∥xi∥2 ψ 2≤ Cσz

(121) Next, for some l, m ∈ [d], let the Kronecker delta δlm = 1 only when l = m and 0 otherwise. Then we have for all i∈ [n] that ∥zi ([xi]l[xi]m− δlm)∥ψ 2/ 3 ≤∥ zi∥ψ 2∥xi∥2 ψ 2≤ Cσz. (122) where the ﬁrst inequality follows by Corollary D.3. Therefo re, using [Zhang and Wei, 2022, Proposition 3] with the union bound over d2 entries, we have that P (  ...

work page 2022

[15] [15]

Once again R > 0 can be made suﬃciently small so that the right-hand side of (125) deﬁned as ̺ is arbitrarily small

≜ ̺, (125) where (i) follows from [Kim and Lee, 2024, Lemma 7.5]; (ii) ho lds since a log1/ 2(2/a ) is monotone increasing for a∈ (0, 1]; (iii) follows from the fact that a≤ b 2 log− 1/ 2(1/b ) implies a log1/ 2(2/a )≤ b for b∈ (0, 0.1]; (iv) holds since π⋆ j≥ πmin for all j∈ [k] by the deﬁntion of πmin in (16); and (v) holds since πmin≤ 1 k . Once again ...

work page 2024

[16] [16]

Finally, we can write the bound π⋆ j≥ P(x∈C j∩C ⋆ j )≥ (1− 2̺ 1− ̺ ) πj

min πjk ≤ π⋆ j πj · R2ζπ 1+2ζ− 1 min k ≤ π⋆ j πj · R2ζ k2(1+ζ− 1)≤ ̺ 1− ̺ , (127) 49 where the last inequality follows from the deﬁnition of ̺ in (125) and the bound derived in (126). Finally, we can write the bound π⋆ j≥ P(x∈C j∩C ⋆ j )≥ (1− 2̺ 1− ̺ ) πj. (128) Combining (126) and (128) provides the assertion in (85) thu s concluding the proof. D.4 Proof...

work page 2018

[17] [17]

, ωn) > ǫ     ≤ δ, (136) if n≥ C(η∨ 1)4ǫ− 2 [ sk log (n s ) + s log (d s ) + log (1 δ )] which concludes the proof

Then, using the union bound, and inﬂating the probability of error δ by the worst-case cardinality in (133), we obtain that P     sup x′ 1,..., x′ n {ω i}n i=1∈H (Pk,d,s , {x′ i}n i=1) f (ω 1, . . . , ωn) > ǫ     ≤ δ, (136) if n≥ C(η∨ 1)4ǫ− 2 [ sk log (n s ) + s log (d s ) + log (1 δ )] which concludes the proof. D.5 Proof of Lemma B.3 By the deﬁn...

work page 2023

[18] [18]

For ﬁxed U∈Z s, we have that Sv, v⋆∈P 2,d (U ) and similar to (131) we have sup x1,..., xn |H(P2,d (U ),{xi}n i=1)|≤ ( en s + 1 )2(s+1)

min k2 , (139) when the sample complexity satisﬁes (84). For ﬁxed U∈Z s, we have that Sv, v⋆∈P 2,d (U ) and similar to (131) we have sup x1,..., xn |H(P2,d (U ),{xi}n i=1)|≤ ( en s + 1 )2(s+1) . (140) 51 Therefore, for ﬁxed U∈Z s we have by [Kim and Lee, 2023, Lemma 6.3] with probability at least 1− δ that sup (v, v⋆ )∈M ⏐ ⏐ ⏐ ⏐ ⏐ 1 n n∑ i=1 1{ξi∈S v, v⋆ ...

work page 2023

[19] [19]

Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1

and α ∈ (0, 1). Let {xi}n i=1 be independent copies of a random vector x that satisﬁes Assumption 2.1. Then, with probability at least 1− δ we have sup I:|I|≤αn U ∈Z s      1 n ∑ i∈I Π U xi      2 ≤ C(η2∨ 1)√ α, (160) if n≥ α − 1 [s log(d/s ) + log(1/δ )] . (161) Proof. AssumeU is ﬁxed and let x′ = [ x]U ∈ Rs. Deﬁne the collection of all possibl...

work page 2019