Sparse Max-Affine Regression
Pith reviewed 2026-05-23 17:36 UTC · model grok-4.3
The pith
Sparse gradient descent recovers the parameters of a sparse max-affine model exactly from O(s log(d/s)) noise-free observations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the model order and parameters are fixed, Sp-GD provides an ε-accurate estimate given O(max(ε^{-2}σ_z²,1)s log(d/s)) observations under sub-Gaussian noise with anti-concentration, implying exact recovery from O(s log(d/s)) noise-free observations.
What carries the argument
Sparse Gradient Descent (Sp-GD) that iteratively refines sparse weight vectors inside the max-affine representation while performing variable selection.
If this is right
- Sp-GD converges locally to the ground-truth parameters at the stated sample rate.
- The sparse-PCA initialization returns an ε-accurate starting point after O(ε^{-2} max(σ_z^4, σ_z², 1) s² log⁴(d)) Gaussian samples.
- Real Maslov Dequantization maps sparse generalized polynomials to max-affine models with error that decays exponentially in the temperature parameter.
- All convergence guarantees extend directly to the bounded-noise regime created by the dequantization map.
Where Pith is reading between the lines
- The exponential decay rate of the dequantization error suggests the same transformation could approximate other classes of sparse non-smooth functions.
- If anti-concentration can be verified for a wider family of covariate distributions, the same Sp-GD analysis would immediately cover additional piecewise-linear regression settings.
- Allowing the number of affine pieces k to grow slowly with dimension would be a direct next step that preserves the logarithmic dependence on d.
Load-bearing premise
The covariate distribution must obey sub-Gaussianity together with anti-concentration for the local convergence proof to apply.
What would settle it
Run Sp-GD on noise-free data drawn from a distribution satisfying the stated assumptions and check whether recovery error stays bounded away from zero once the sample count drops below c s log(d/s) for small constant c.
Figures
read the original abstract
This paper presents Sparse Gradient Descent as a solution for variable selection in convex piecewise linear regression, where the model is given as the maximum of $k$-affine functions $ x \mapsto \max_{j \in [k]} \langle a_j^\star, x \rangle + b_j^\star$ for $j = 1,\dots,k$. Here, $\{ a_j^\star\}_{j=1}^k$ and $\{b_j^\star\}_{j=1}^k$ denote the ground-truth weight vectors and intercepts. A non-asymptotic local convergence analysis is provided for Sp-GD under sub-Gaussian noise when the covariate distribution satisfies the sub-Gaussianity and anti-concentration properties. When the model order and parameters are fixed, Sp-GD provides an $\epsilon$-accurate estimate given $\mathcal{O}(\max(\epsilon^{-2}\sigma_z^2,1)s\log(d/s))$ observations where $\sigma_z^2$ denotes the noise variance. This also implies the exact parameter recovery by Sp-GD from $\mathcal{O}(s\log(d/s))$ noise-free observations. The proposed initialization scheme uses sparse principal component analysis to estimate the subspace spanned by $\{ a_j^\star\}_{j=1}^k$, then applies an $r$-covering search to estimate the model parameters. A non-asymptotic analysis is presented for this initialization scheme when the covariates and noise samples follow Gaussian distributions. When the model order and parameters are fixed, this initialization scheme provides an $\epsilon$-accurate estimate given $\mathcal{O}(\epsilon^{-2}\max(\sigma_z^4,\sigma_z^2,1)s^2\log^4(d))$ observations. A new transformation named Real Maslov Dequantization (RMD) is proposed to transform sparse generalized polynomials into sparse max-affine models. The error decay rate of RMD is shown to be exponentially small in its temperature parameter. Furthermore, theoretical guarantees for Sp-GD are extended to the bounded noise model induced by RMD. Numerical Monte Carlo results corroborate theoretical findings for Sp-GD and the initialization scheme.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Sparse Gradient Descent (Sp-GD) for sparse max-affine regression, where the target is the pointwise maximum of k affine functions. It provides a non-asymptotic local convergence analysis of Sp-GD under sub-Gaussian noise when covariates satisfy sub-Gaussianity and anti-concentration, claiming that (with k and parameters fixed) an ε-accurate estimate is obtained from O(max(ε^{-2}σ_z²,1) s log(d/s)) samples; this is said to imply exact recovery from O(s log(d/s)) noise-free observations. A separate initialization procedure based on sparse PCA followed by r-covering search is analyzed under Gaussian covariates/noise and shown to require O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) samples. The paper also introduces Real Maslov Dequantization (RMD) to convert sparse generalized polynomials into max-affine models (with exponentially small approximation error in the temperature parameter) and extends the Sp-GD guarantees to the induced bounded-noise setting. Monte Carlo experiments are reported to corroborate the theory.
Significance. If the central sample-complexity claims can be made internally consistent, the non-asymptotic local-convergence result for Sp-GD and the accompanying initialization analysis would constitute a concrete contribution to the theory of sparse piecewise-linear regression in high dimensions. The RMD transformation is a novel technical device whose exponential error decay is a clear strength. The work supplies explicit dependence on sparsity s, dimension d, and noise level, which is useful when such rates are needed for downstream algorithm design.
major comments (1)
- [Abstract] Abstract: the claim that Sp-GD yields exact recovery from O(s log(d/s)) noise-free observations is not supported by the given initialization bound. The initialization analysis (under Gaussian covariates) requires O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) samples and is not shown to improve to O(s log(d/s)) when σ_z=0. Because the Sp-GD guarantee is local, the overall procedure cannot achieve the stated noise-free rate unless the initialization succeeds at that rate.
minor comments (1)
- The abstract refers to 'non-asymptotic analyses' for both Sp-GD and initialization without citing specific theorem or proposition numbers, making it difficult to locate the precise statements that support the displayed sample-complexity expressions.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for identifying an important inconsistency in the abstract. We address the comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that Sp-GD yields exact recovery from O(s log(d/s)) noise-free observations is not supported by the given initialization bound. The initialization analysis (under Gaussian covariates) requires O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) samples and is not shown to improve to O(s log(d/s)) when σ_z=0. Because the Sp-GD guarantee is local, the overall procedure cannot achieve the stated noise-free rate unless the initialization succeeds at that rate.
Authors: We agree that the referee has identified a genuine inconsistency. The local convergence analysis of Sp-GD (under sub-Gaussian covariates and noise) shows that an ε-accurate estimate is obtained from O(max(ε^{-2}σ_z²,1) s log(d/s)) samples, which specializes to O(s log(d/s)) samples for exact recovery when σ_z=0. However, this is a local result that assumes the iterate begins inside a basin of attraction. The separate initialization analysis is performed under Gaussian assumptions and yields the stated O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d)) bound, which remains O(ε^{-2} s² log^4(d)) even when σ_z=0 and therefore does not support the noise-free claim. We will revise the abstract to remove the implication that the full procedure (initialization plus Sp-GD) achieves exact recovery from O(s log(d/s)) noise-free observations, and will instead present the local convergence rate and the initialization rate as separate results, noting that global guarantees depend on the quality of the initializer. revision: yes
Circularity Check
No circularity; independent analyses under stated assumptions
full rationale
The provided abstract and context show separate non-asymptotic local convergence analysis for Sp-GD (yielding O(max(ε^{-2}σ_z²,1)s log(d/s)) under sub-Gaussian anti-concentration) and a distinct Gaussian analysis for the sparse-PCA + r-covering initialization (O(ε^{-2} max(σ_z^4,σ_z²,1) s² log^4(d))). These are derived from explicit distributional assumptions rather than any self-definition, fitted-input-as-prediction, or load-bearing self-citation chain. No quoted step reduces the claimed sample complexity or exact-recovery implication to a tautology by construction. The derivation chain is self-contained against the stated external assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Covariates are sub-Gaussian and satisfy anti-concentration properties
- domain assumption Noise is sub-Gaussian with variance σ_z²
invented entities (1)
-
Real Maslov Dequantization (RMD)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is min...
Reference graph
Works this paper leans on
- [1]
-
[2]
Optimal Bounds on the VC-dimension
M. Csikos, A. Kupavskii, and N. H. Mustafa. Optimal bounds on the vc-dimension. arXiv preprint arXiv:1807.07924,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
Exact and approximation algorithms for sparse pca
Y. Li and W. Xie. Exact and approximation algorithms for spar se pca. arXiv preprint arXiv:2008.12438,
-
[5]
URL https://proceedings.neurips.cc/paper_files/paper/2014/file/ 74563ba21a90da13dacf2a73e3ddefa7-Paper.pdf . L. Wasserman and K. Roeder. High-dimensional variable sele ction. The Annals of Statis- tics, pages 2178–2201,
work page 2014
-
[6]
X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of man y random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Then for all ǫ∈ [0, 1], there exists an absolute constant C > 0 where it holds with probability at least 1− δ that sup U ∈Z s ˜Π U n∑ i=1 ωi ([xi; 1][xi; 1]T− Id+1) ˜Π U ≤ ℓǫ (78) if l≥ C(η∨ 1)4ǫ− 2 [ s log (d s ) + log (1 δ )] . (79) Proof. The proof of the lemma relies on the unitary invariance of the spectral norm. Let {ri}n i=1 be i...
work page 2019
-
[8]
min [ s log (n∨ d s ) + log (k δ )] . (90) Proof. The proof of this lemma is a direct application of the union bo und by inflating the probability of error by |Zs|= (d s ) ≤ (ed s )s to the statement in [Kim and Lee, 2024, Lemma 7.7]. The next lemma provides an upper bound of a noise-related ter m that will appear in the proof of the main theorem. Lemma B....
work page 2024
-
[9]
Let {xi}n i=1 be independent copies of a random vector x that satisfies Assumption 2.1
and α ∈ (0, 1). Let {xi}n i=1 be independent copies of a random vector x that satisfies Assumption 2.1. Then, with probability at least 1− δ we have sup I:|I|≤αn U ∈Z s λ 1 [ ˜Π U ( 1 n ∑ i∈I [xi; 1][xi; 1]⊤ ) ˜Π ⊤ U ] ≤ C(η2∨ 1)√ α, (92) if n≥ α − 1 [s log(d/s ) + log(1/δ )] . (93) Proof. For fixed U ∈Zs, (92) follows directly from [Tan and Vershynin, 2019...
work page 2019
-
[10]
Again, comparing this tail probability with the last inequality yields the assert ion in this lemma
for all t > 0 where the second inequality follows from the tail probabil ity of any sub- exponential random variable [Vershynin, 2018, Propositio n 2.7.1]. Again, comparing this tail probability with the last inequality yields the assert ion in this lemma. A simple graph- ical example is provided in Figure 6, assuming that Cmax j∈ [k] ∥Xj∥ψ 1 = 1, where w...
work page 2018
-
[11]
Noti ce that xi d∼ xiri for every i∈ [n] by the symmetry of the Gaussian distribution
(105) Let{ri}n i=1 be independent copies of a Rademacher random variable. Noti ce that xi d∼ xiri for every i∈ [n] by the symmetry of the Gaussian distribution. Therefore un der the event 44 n⋂ i=1 Fi, it holds that P ( 1 n n∑ i=1 zi 5σz √ log n xi ∞ ≥ t ) = P ( 1 n n∑ i=1 zi 5σz √ log n xiri ∞ ≥ t ) ≤ 2P ( 1...
work page 2013
-
[12]
We proceed in a similar fashion and decompose the difference of the second moments as ˆM2− M2 = 1 n n∑ i=1 max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i− Id)− E [ max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i− Id) ] + 1 n n∑ i=1 zi(xixT i− Id) = 1 n n∑ i=1 ( max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i )− E [ max j∈ [k] ⟨θ⋆ j , [xi; 1]⟩(xixT i ) ]) Q1 − Id· 1 n n∑ i=1 ( max j∈ [k]...
work page 2023
-
[13]
(114) 46 We observe that one valid pair for Young’s inequality is ( p, q ) = (α +θ α , α +θ θ ) . Therefore, we have that E [ exp { (XY )(α +θ)− 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 }] ≤ E [ exp { α α + θ· Xα − 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 } ·exp { θ α + θ· Yθ− 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 }] ≤ E [ α α + θ·exp { Xα − 1 ∥X∥ψ α − 1 ∥Y∥ψ θ − 1 } + θ α + θ·exp { Yθ− 1 ∥X∥ψ α − 1 ∥Y...
work page 2022
-
[14]
Then we have for all i∈ [n] that ∥zi ([xi]l[xi]m− δlm)∥ψ 2/ 3 ≤∥ zi∥ψ 2∥xi∥2 ψ 2≤ Cσz
(121) Next, for some l, m ∈ [d], let the Kronecker delta δlm = 1 only when l = m and 0 otherwise. Then we have for all i∈ [n] that ∥zi ([xi]l[xi]m− δlm)∥ψ 2/ 3 ≤∥ zi∥ψ 2∥xi∥2 ψ 2≤ Cσz. (122) where the first inequality follows by Corollary D.3. Therefo re, using [Zhang and Wei, 2022, Proposition 3] with the union bound over d2 entries, we have that P ( ...
work page 2022
-
[15]
≜ ̺, (125) where (i) follows from [Kim and Lee, 2024, Lemma 7.5]; (ii) ho lds since a log1/ 2(2/a ) is monotone increasing for a∈ (0, 1]; (iii) follows from the fact that a≤ b 2 log− 1/ 2(1/b ) implies a log1/ 2(2/a )≤ b for b∈ (0, 0.1]; (iv) holds since π⋆ j≥ πmin for all j∈ [k] by the defintion of πmin in (16); and (v) holds since πmin≤ 1 k . Once again ...
work page 2024
-
[16]
Finally, we can write the bound π⋆ j≥ P(x∈C j∩C ⋆ j )≥ (1− 2̺ 1− ̺ ) πj
min πjk ≤ π⋆ j πj · R2ζπ 1+2ζ− 1 min k ≤ π⋆ j πj · R2ζ k2(1+ζ− 1)≤ ̺ 1− ̺ , (127) 49 where the last inequality follows from the definition of ̺ in (125) and the bound derived in (126). Finally, we can write the bound π⋆ j≥ P(x∈C j∩C ⋆ j )≥ (1− 2̺ 1− ̺ ) πj. (128) Combining (126) and (128) provides the assertion in (85) thu s concluding the proof. D.4 Proof...
work page 2018
-
[17]
Then, using the union bound, and inflating the probability of error δ by the worst-case cardinality in (133), we obtain that P sup x′ 1,..., x′ n {ω i}n i=1∈H (Pk,d,s , {x′ i}n i=1) f (ω 1, . . . , ωn) > ǫ ≤ δ, (136) if n≥ C(η∨ 1)4ǫ− 2 [ sk log (n s ) + s log (d s ) + log (1 δ )] which concludes the proof. D.5 Proof of Lemma B.3 By the defin...
work page 2023
-
[18]
min k2 , (139) when the sample complexity satisfies (84). For fixed U∈Z s, we have that Sv, v⋆∈P 2,d (U ) and similar to (131) we have sup x1,..., xn |H(P2,d (U ),{xi}n i=1)|≤ ( en s + 1 )2(s+1) . (140) 51 Therefore, for fixed U∈Z s we have by [Kim and Lee, 2023, Lemma 6.3] with probability at least 1− δ that sup (v, v⋆ )∈M ⏐ ⏐ ⏐ ⏐ ⏐ 1 n n∑ i=1 1{ξi∈S v, v⋆ ...
work page 2023
-
[19]
Let {xi}n i=1 be independent copies of a random vector x that satisfies Assumption 2.1
and α ∈ (0, 1). Let {xi}n i=1 be independent copies of a random vector x that satisfies Assumption 2.1. Then, with probability at least 1− δ we have sup I:|I|≤αn U ∈Z s 1 n ∑ i∈I Π U xi 2 ≤ C(η2∨ 1)√ α, (160) if n≥ α − 1 [s log(d/s ) + log(1/δ )] . (161) Proof. AssumeU is fixed and let x′ = [ x]U ∈ Rs. Define the collection of all possibl...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.