Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks

Massimo Fornasier; Michael Rauchensteiner; Timo Klock

arxiv: 1907.00485 · v1 · pith:JCODI7VMnew · submitted 2019-06-30 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks

Massimo Fornasier , Timo Klock , Michael Rauchensteiner This is my paper

Pith reviewed 2026-05-25 12:20 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML

keywords neural network identificationtwo hidden layersHessian finite differencessubspace recoveryrobust nonlinear programstable recoveryactivation shiftssample complexity

0 comments

The pith

Two-hidden-layer neural networks can be identified from few samples by approximating a weight subspace from Hessian finite differences and solving a nonlinear program.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to recover the weights of a two-hidden-layer neural network of the form f(x) = 1^T h(B^T g(A^T x)) from a small number of query samples. It does this by actively sampling finite difference approximations to the network's Hessians, which together approximate the subspace spanned by symmetric tensors built from the first-layer weights and certain combinations involving the second layer. A robust nonlinear program then isolates the individual rank-one tensors inside that subspace. This approach comes with guarantees of stable recovery when certain conditions can be checked after sampling, and it also recovers activation shifts. A sympathetic reader would care because the method is fully constructive and reduces the opaque nature of network training by giving explicit sample complexity bounds.

Core claim

By gathering approximate Hessians via finite differences, the method approximates the matrix subspace W spanned by the symmetric tensors a1⊗a1,...,am0⊗am0 from the first layer weights together with the entangled tensors vℓ⊗vℓ from first and second layer combinations, then identifies the rank-one tensors by solving a robust nonlinear program, providing stable recovery guarantees under a posteriori verifiable conditions, and further attributes weights to layers and estimates activation shifts via adapted gradient descent.

What carries the argument

The matrix subspace W spanned by symmetric tensors a_i ⊗ a_i and entangled v_ℓ ⊗ v_ℓ recovered from finite-difference Hessian approximations, which enables isolation of rank-one tensors via nonlinear program.

If this is right

Stable recovery of the network weights up to intrinsic symmetries under a posteriori verifiable conditions.
Correct attribution of approximate weights to the first or second layer.
Estimation of shifts of the activation functions of the first layer via adapted gradient descent, allowing exact computation of the matrix G0.
Fully constructive identification with quantifiable sample complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace recovery step via Hessians could be iterated to identify deeper networks by peeling off layers successively.
The recovered weights might serve as improved initializations for standard gradient-based training of similar two-layer architectures.
The a posteriori verifiable conditions could certify successful identification in practice without access to the original training data.

Load-bearing premise

The finite-difference approximations to the Hessians are accurate enough to recover the subspace spanned by the symmetric tensors formed by the weights.

What would settle it

Observing that the robust nonlinear program returns incorrect rank-one tensors despite the subspace W being accurately recovered from the Hessians would falsify the recovery guarantee.

Figures

Figures reproduced from arXiv: 1907.00485 by Massimo Fornasier, Michael Rauchensteiner, Timo Klock.

**Figure 2.** Figure 2: Error in approximating W for perturbed orthogonal weights and different activation functions. The meaning of estimate (63) is explained by the following mechanism: whenever the deviation of an iteration Mj of Algorithm 3 from being a rank-1 matrix in W is large, in the sense that kPW(u1 ⊗ u1)kF is small, then the constant Θ = CW cW 1/2 P λj>0 λjkPW(uj ⊗ uj )kF is also small and the iteration Mj+1 = F… view at source ↗

**Figure 3.** Figure 3: False positive and recovery rates for perturbed orthogonal weights and for different [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

**Figure 4.** Figure 4: Error in approximating W for weights sampled independently from the unit sphere, and for different activation functions. • the normalized projection error kPˆW−PWk 2 F m0+m1 , • a false positive rate FP(T) = #{j:E( ˆwj )>T} m0+m1 , where T > 0 is a threshold, and E(u) is defined by, E(u) := min w∈{±ai,±v`:i∈[m0],`∈[m1]} ku − wk 2 2 , • recovery rate Ra(T) = #{i:E(ai)<T} m0 , and Rv(T) = #{`:E(v`)<T} m1 , w… view at source ↗

**Figure 5.** Figure 5: False positive and recovery rates for weights sampled uniformly at random from the [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

**Figure 6.** Figure 6: We illustrate the trajectories t → k∇f(tw)k2 for w ∈ {wˆj : j ∈ [m]}. The blue trajectories are those for w ∈ {wˆj ≈ ai for some i} and the red trajectories are those for w ∈ {wˆj ≈ v` for some `}. We can observe the separation of the trajectories due to the different decay properties. 6.1 Distinguishing first and second layer weights Attributing approximate entangled weights to first or second layer is ge… view at source ↗

read the original abstract

We address the structure identification and the uniform approximation of two fully nonlinear layer neural networks of the type $f(x)=1^T h(B^T g(A^T x))$ on $\mathbb R^d$ from a small number of query samples. We approach the problem by sampling actively finite difference approximations to Hessians of the network. Gathering several approximate Hessians allows reliably to approximate the matrix subspace $\mathcal W$ spanned by symmetric tensors $a_1 \otimes a_1 ,\dots,a_{m_0}\otimes a_{m_0}$ formed by weights of the first layer together with the entangled symmetric tensors $v_1 \otimes v_1 ,\dots,v_{m_1}\otimes v_{m_1}$, formed by suitable combinations of the weights of the first and second layer as $v_\ell=A G_0 b_\ell/\|A G_0 b_\ell\|_2$, $\ell \in [m_1]$, for a diagonal matrix $G_0$ depending on the activation functions of the first layer. The identification of the 1-rank symmetric tensors within $\mathcal W$ is then performed by the solution of a robust nonlinear program. We provide guarantees of stable recovery under a posteriori verifiable conditions. We further address the correct attribution of approximate weights to the first or second layer. By using a suitably adapted gradient descent iteration, it is possible then to estimate, up to intrinsic symmetries, the shifts of the activations functions of the first layer and compute exactly the matrix $G_0$. Our method of identification of the weights of the network is fully constructive, with quantifiable sample complexity, and therefore contributes to dwindle the black-box nature of the network training phase. We corroborate our theoretical results by extensive numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a constructive way to recover weights in two-hidden-layer networks by sampling approximate Hessians to span the right subspace then extracting the rank-1 factors with a nonlinear program.

read the letter

The core contribution is a pipeline that samples finite-difference Hessians to approximate the subspace spanned by the first-layer outer products plus the entangled second-layer terms, then solves a robust nonlinear program to pull out the individual rank-1 tensors and attributes them to the correct layer. They also recover the activation shifts via a gradient-descent step once G0 is known. This extends single-layer results to exactly two layers in a fully constructive manner with stated sample complexity and a-posteriori verifiable recovery conditions, backed by numerical experiments on the abstract's description. That combination of active sampling, subspace recovery, and layer attribution is the actual advance. The approach is honest about depending on the finite-difference step being accurate enough to capture W; if the approximation drifts, the downstream nonlinear program has nothing useful to work with. The paper claims this step is reliable under the conditions they verify after the fact, and the experiments are said to corroborate it, so the central claim appears to hold in the tested regimes. Minor practical question is how sensitive the finite-difference accuracy is to step-size choice or activation curvature outside the clean synthetic cases. This is for readers working on network identifiability and query-based recovery in theoretical ML. It deserves peer review because the two-layer extension is new, the method is explicit, and there is both theory and numerics to assess.

Referee Report

2 major / 2 minor

Summary. The paper claims a constructive method to identify weights and approximate two-hidden-layer networks f(x)=1^T h(B^T g(A^T x)) from few queries: active finite-difference Hessian sampling recovers the subspace W spanned by rank-1 tensors a_i ⊗ a_i and entangled v_ℓ ⊗ v_ℓ (with v_ℓ = A G_0 b_ℓ / ||A G_0 b_ℓ||_2), a robust nonlinear program isolates the tensors, a-posteriori verifiable conditions yield stable recovery guarantees, layer attribution is resolved, and gradient-descent estimates first-layer shifts and G_0 exactly. The approach is fully constructive with quantifiable sample complexity and is supported by numerical experiments.

Significance. If the recovery guarantees hold, the work supplies an explicit, query-based identification procedure that quantifies sample needs and reduces the black-box character of network training; the constructive nature and numerical corroboration are explicit strengths.

major comments (2)

[Abstract (sampling Hessians and subspace W paragraph)] Abstract (sampling Hessians and subspace W paragraph): the central claim that 'gathering several approximate Hessians allows reliably to approximate the matrix subspace W' is load-bearing for all downstream steps (nonlinear program, layer attribution, shift estimation). No explicit finite-difference error bounds, sampling-density requirements, or conditioning assumptions on G_0 appear, so it is unclear under what verifiable conditions the estimated W remains close enough to the true span for the subsequent rank-1 identification to succeed.
[Abstract (guarantees paragraph)] Abstract (guarantees paragraph): the statement 'We provide guarantees of stable recovery under a posteriori verifiable conditions' is the main theoretical contribution, yet the abstract supplies neither the form of these conditions nor how they are checked after the nonlinear program; without this, the claim that the pipeline yields stable recovery cannot be assessed.

minor comments (2)

The phrase '1-rank symmetric tensors' should be replaced by the standard 'rank-1' throughout for clarity.
Dimension symbols m_0, m_1, d and the precise meaning of the diagonal matrix G_0 are introduced only implicitly; an early explicit statement would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need for greater clarity in the abstract regarding the theoretical foundations. We address each major comment below and will revise the abstract accordingly to better convey the error bounds, sampling requirements, and verification procedures while preserving its concise nature. The full details remain in the body of the manuscript.

read point-by-point responses

Referee: [Abstract (sampling Hessians and subspace W paragraph)] Abstract (sampling Hessians and subspace W paragraph): the central claim that 'gathering several approximate Hessians allows reliably to approximate the matrix subspace W' is load-bearing for all downstream steps (nonlinear program, layer attribution, shift estimation). No explicit finite-difference error bounds, sampling-density requirements, or conditioning assumptions on G_0 appear, so it is unclear under what verifiable conditions the estimated W remains close enough to the true span for the subsequent rank-1 identification to succeed.

Authors: The finite-difference error bounds, sampling-density requirements, and dependence on the conditioning of G_0 are derived explicitly in Section 3. Theorem 3.1 bounds the perturbation of each approximate Hessian, while Theorem 3.2 controls the resulting subspace distance in terms of the number of samples, the minimal singular value of G_0, and the activation Lipschitz constants. These quantities are a posteriori verifiable by inspecting the numerical rank and conditioning of the collected Hessian matrices. We will revise the abstract to reference these results and state the key assumptions on G_0. revision: yes
Referee: [Abstract (guarantees paragraph)] Abstract (guarantees paragraph): the statement 'We provide guarantees of stable recovery under a posteriori verifiable conditions' is the main theoretical contribution, yet the abstract supplies neither the form of these conditions nor how they are checked after the nonlinear program; without this, the claim that the pipeline yields stable recovery cannot be assessed.

Authors: The form of the conditions and the post-nonlinear-program verification procedure are stated in Theorem 4.2 and the accompanying Algorithm 1. After recovery, one checks the residual norm of the rank-1 decomposition against the subspace error bound and verifies a minimum angular separation between the recovered tensors; both checks use only the sampled Hessians and the program output. We agree the abstract is too terse on this point and will update it to summarize the verification steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's identification pipeline begins from external query samples, computes finite-difference Hessian approximations, recovers subspace W, solves a robust nonlinear program for rank-1 tensors, performs layer attribution, and applies gradient descent to recover shifts and G_0. All steps are presented as constructive with quantifiable sample complexity and a-posteriori verifiable recovery conditions. No step reduces by construction to a fitted input, self-definition, or unverified self-citation chain; the central claims remain independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the network exactly matching the stated functional form and on the existence of verifiable conditions that certify subspace recovery; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption The network is exactly of the form f(x) = 1^T h(B^T g(A^T x)) on R^d
Stated as the problem setup in the first sentence of the abstract.
domain assumption Finite-difference approximations to Hessians are accurate enough to recover the subspace W spanned by the indicated symmetric tensors
Invoked when the abstract says 'sampling actively finite difference approximations to Hessians ... allows reliably to approximate the matrix subspace W'.

pith-pipeline@v0.9.0 · 5862 in / 1457 out tokens · 45822 ms · 2026-05-25T12:20:32.648901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 9 internal anchors

[1]

Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-$1$ Updates

A. Anandkumar, R. Ge, and M. Janzamin, Guaranteed non-orthogonal tensor decomposition via alternating rank- 1 updates, arXiv:1402.5180, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[2]

Anthony and P

M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations . Cambridge University Press, Cambridge, 1999

work page 1999
[3]

Bach, Breaking the curse of dimensionality with convex neural networks ,

F. Bach, Breaking the curse of dimensionality with convex neural networks ,

work page
[4]

R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 1997

work page 1997
[5]

J. J. Benedetto and M. Fickus, Finite normalized tight frames , Advances in Computational Mathematics, Vol. 18, No. 24, pp 357385, 2003 J. Mach. Learn. Res. 18 (2017), 1–53

work page 2003
[6]

Bengio, P

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19 (NIPS 2006)

work page 2006
[7]

A. L. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete. Neural Networks 5 (1) (1992), 117–127

work page 1992
[8]

T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, High-performance OCR for printed English and Fraktur using LSTM networks , In: 12th International Conference on Document Analysis and Recognition (2013), 683–687

work page 2013
[9]

Bruna and S

J. Bruna and S. Mallat. Invariant scattering convolution networks . IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):18721886, 2013

work page 2013
[10]

Carlini and D

N. Carlini and D. Wagner, Towards evaluating the robustness of neural networks , In: 2017 IEEE Symposium on Security and Privacy (SP) (2017), pp. 39–57

work page 2017
[11]

P. G. Casazza and N. Leonhard, Classes of ﬁnite equal norm Parseval frames , Contemporary Mathematics, 451, 2008

work page 2008
[12]

Ciresan, U

D.C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, Multi-column deep neural network for traﬃc sign classiﬁcation , Neural Networks 32 (2012), 333–338

work page 2012
[13]

Cohen, I

A. Cohen, I. Daubechies, R. DeVore, g. Kerkyacharian, and D. Picard. Capturing ridge functions in high dimensions from point queries . Constructive Approximation, 35(2):225–243, Apr 2012. 43

work page 2012
[14]

P. Constantine, Active Subspaces: Emerging Ideas for Dimension Reduction in Parame- ter Studies, SIAM Spotlights 2., Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2015

work page 2015
[15]

Constantine, E

P. Constantine, E. Dow, and Q. Wang, Active subspaces in theory and practice: Applications to kriging surfaces , SIAM J. Sci. Comput. 36 (2014), pp. A1500–A1524

work page 2014
[16]

De Silva and L.-H

Vi. De Silva and L.-H. Lim, Tensor rank and the ill-posedness of the best low-rank approxi- mation problem, SIAM J. Matrix Anal. Appl. 30 (3) (2008), 1084–1127

work page 2008
[17]

DeVore, K

R. DeVore, K. Oskolkov, and P. Petrushev, Approximation of feed-forward neural networks , Ann. Numer. Math. 4 (1997), 261–287

work page 1997
[18]

Devroye and L

L. Devroye and L. Gy¨ orﬁ, Nonparametric Density Estimation, Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics, John Wiley & Sons Inc., New York, 1985

work page 1985
[19]

Feﬀerman, Reconstructing a neural net from its output, Rev

C. Feﬀerman, Reconstructing a neural net from its output, Rev. Mat. Iberoam. 10 (3) (1994), 507–555

work page 1994
[20]

Fornasier, K

M. Fornasier, K. Schnass, and J. Vyb´ ıral.Learning functions of few arbitrary linear param- eters in high dimensions . Found. Comput. Math., 12(2):229–262, April 2012

work page 2012
[21]

Fornasier, J

M. Fornasier, J. Vyb´ ıral, and I. Daubechies. Robust and resource eﬃcient identiﬁcation of shallow neural networks by fewest samples . arXiv:1804.01592v2, https://arxiv.org/pdf/ 1804.01592.pdf, 2019

work page arXiv 2019
[22]

Fiedler, M

C. Fiedler, M. Fornasier, T. Klock, and M. Rauchensteiner Robust and resource eﬃcient identiﬁcation of deep neural networks , in preparation

work page
[23]

Foucart and H

S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkh¨ auser, 2013

work page 2013
[24]

Tail bounds for all eigenvalues of a sum of random matrices

A. Gittens and J. A. Tropp. Tail bounds for all eigenvalues of a sum of random matrices . arXiv:1104.4513, Apr 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[25]

Graves, A.-R

A. Graves, A.-R. Mohamed, and G. E. Hinton, Speech recognition with deep recurrent neural networks, In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), 6645–6649

work page 2013
[26]

Golowich, A

N. Golowich, A. Rakhlin, O. Shamir, Size-independent sample complexity of neural networks, Proceedings of the 31st Conference On Learning Theory, 85, 297–299, 2018

work page 2018
[27]

Grohs, D

P. Grohs, D. Perekrestenko, D. Elbraechter, H. Boelcskei,Deep neural network approximation theory, arXiv:1901.02220

work page arXiv 1901
[28]

H˚ astad,Tensor rank is NP-complete, J

J. H˚ astad,Tensor rank is NP-complete, J. Algorithms 11 (4) (1990), 644-654

work page 1990
[29]

Ch. J. Hillar and L.-H. Lim, Most tensor problems are NP-hard, J. ACM 60 (6) (2013), 1–45

work page 2013
[30]

Hristache, A

M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation of theindex coeﬃcient in a single-index model. Annals of Statistics, pages 595623, 2001

work page 2001
[31]

Ichimura

H. Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58(1-2):71120, 1993

work page 1993
[32]

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

M. Janzamin, H. Sedghi, and A. Anandkumar, Beating the Perils of Non-Convexity: Guar- anteed Training of Neural Networks using Tensor Methods . arXiv:1506.08473, Jun 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

J. S. Judd, Neural network design and the complexity of learning, MIT press, 1990. 44

work page 1990
[34]

Kawaguchi, Deep learning without poor local minima , Advances in Neural Information Processing Systems (NIPS 2016)

K. Kawaguchi, Deep learning without poor local minima , Advances in Neural Information Processing Systems (NIPS 2016)

work page 2016
[35]

T. G. Kolda, Symmetric orthogonal tensor decomposition is trivial , arXiv:1503.01375, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[36]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton,Imagenet classiﬁcation with deep convolutional neural networks, In: Advances in Neural Information Processing Systems (NIPS) (2012), 1–9

work page 2012
[37]

Li, On principal hessian directions for data visualization and dimension reduction: an- other application of Stein’s Lemma , J

K. Li, On principal hessian directions for data visualization and dimension reduction: an- other application of Stein’s Lemma , J. Am. Stat. Assoc. 87 (420) (1992), 1025–1039

work page 1992
[38]

Li, Interpolation by ridge polynomials and its application in neural networks , J

X. Li, Interpolation by ridge polynomials and its application in neural networks , J. Comput. Appl. Math. 144 (1-2) (2002), 197–209

work page 2002
[39]

Light, Ridge functions, sigmoidal functions and neural networks , Approximation theory VII, Proc

W. Light, Ridge functions, sigmoidal functions and neural networks , Approximation theory VII, Proc. 7th Int. Symp., Austin/TX (USA) 1992, 163–206 (1993)

work page 1992
[40]

R Magnus

J. R Magnus. On diﬀerentiating eigenvalues and eigenvectors. Econometric Theory, 1(2):179– 191, 1985

work page 1985
[41]

Mayer, T

S. Mayer, T. Ullrich, and J. Vyb´ ıral, Entropy and sampling numbers of classes of ridge functions, Constr. Appr. 42 (2) (2015), 231–264

work page 2015
[42]

S. Mei, T. Misiakiewicz, A. Montanari, Mean-ﬁeld theory of two-layers neural networks: dimension-free bounds and kernel limit , arXiv:1902.06015

work page internal anchor Pith review Pith/arXiv arXiv 1902
[43]

On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

M. Mondelli and A. Montanari, On the connection between learning two-layers neural net- works and tensor decomposition . CoRR, abs/1802.07301, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Moravˇ c´ ık, M

M. Moravˇ c´ ık, M. Schmid, N. Burch, V. Lis´ y, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling, Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker, Science 356, no. 6337 (2017), 508–513

work page 2017
[45]

Nakatsukasa, T

Y. Nakatsukasa, T. Soma, and A. Uschmajew, Finding a low-rank basis in a matrix subspace. Mathematical Programming, 162(1-2):325–361, 2017

work page 2017
[46]

P. P. Petrushev, Approximation by ridge functions and neural networks , SIAM J. Math. Anal. 30 (1) (1999), 155–189

work page 1999
[47]

Pinkus, Approximating by ridge functions

A. Pinkus, Approximating by ridge functions . Le M´ ehaut´ e, Alain (ed.) et al., Surface ﬁtting and multiresolution methods. Vol. 2 of the proceedings of the 3rd international conference on Curves and surfaces, held in Chamonix-Mont-Blanc, France, June 27-July 3, 1996. Nashville, TN: Vanderbilt University Press. 279–292 (1997)

work page 1996
[48]

Pinkus, Approximation theory of the MLP model in neural networks , Acta Numerica, Vol

A. Pinkus, Approximation theory of the MLP model in neural networks , Acta Numerica, Vol. 8, 143-195, 1999

work page 1999
[49]

Q. Qu, J. Sun, and J.Wright, Finding a sparse vector in a subspace: Linear sparsity using alternating directions, IEEE Trans. Inform. Theory 62(10) (2016), 5855–5880

work page 2016
[50]

Rellich and J

F. Rellich and J. Berkowitz. Perturbation theory of eigenvalue problems. CRC Press, 1969

work page 1969
[51]

Orthogonal Decomposition of Symmetric Tensors

E. Robeva, Orthogonal decomposition of symmetric tensors , arXiv:1409.6685, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

G. M. Rotskoﬀ, E. Vanden-Eijnden, Neural networks as interacting particle systems: asymp- totic convexity of the loss landscape and universal scaling of the approximation error , arXiv:1805.00915, 2018

work page arXiv 2018
[53]

Rudelson and R

M. Rudelson and R. Vershynin, Sampling from large matrices: An approach through geomet- ric functional analysis , J. ACM 54 (4), (2007), Art. 21, 19 pp. 45

work page 2007
[54]

Provable approximation properties for deep neural networks

U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. CoRR, abs/1509.07385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[55]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to al- gorithms. Cambridge University Press, 2014

work page 2014
[56]

Silver, A

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser et al., Mastering the game of Go with deep neural networks and tree search , Nature 529, no. 7587 (2016), 484–489

work page 2016
[57]

No bad local minima: Data independent training error guarantees for multilayer neural networks

D. Soudry and Y. Carmon, No bad local minima: Data independent training error guarantees for multilayer neural networks , arXiv:1605.08361

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Stallkamp, M

J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Man vs. computer: Benchmarking machine learning algorithms for traﬃc sign recognition , Neural Networks 32 (2012), 323–332

work page 2012
[59]

Stein, Estimation of the mean of a multivariate normal distribution , Ann

C. Stein, Estimation of the mean of a multivariate normal distribution , Ann. Stat. 9 (1981), 1135–1151

work page 1981
[60]

G. W. Stewart, Perturbation theory for the singular value decomposition , in SVD and Signal Processing, II, ed. R. J. Vacarro, Elsevier, 1991

work page 1991
[61]

Sturm, S

I. Sturm, S. Lapuschkin, W. Samek, and K.-R. M¨ uller, Interpretable deep neural networks for single-trial EEG classiﬁcation , J. Neuroscience Methods 274 (2016), 141–145

work page 2016
[62]

Tao, Topics in random matrix theory , Vol

T. Tao, Topics in random matrix theory , Vol. 132, American Mathematical Soc., 2012

work page 2012
[63]

Tao, When are eigenvalues stable? , What’s new, Blog entry 28 October, 2008 https: //terrytao.wordpress.com/2008/10/28/when-are-eigenvalues-stable/

T. Tao, When are eigenvalues stable? , What’s new, Blog entry 28 October, 2008 https: //terrytao.wordpress.com/2008/10/28/when-are-eigenvalues-stable/

work page 2008
[64]

J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10):2231–2242, 2004

work page 2004
[65]

A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics

work page 1996
[66]

Vershynin

R. Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer

work page 2018
[67]

Wedin, Perturbation bounds in connection with singular value decomposition , BIT 12 (1972), 99–111

P.-A. Wedin, Perturbation bounds in connection with singular value decomposition , BIT 12 (1972), 99–111

work page 1972
[68]

Wiatowski, P

T. Wiatowski, P. Grohs, and H. Boelcskei. Energy propagation in deep convolutional neural networks. IEEE Transactions on Information Theory, PP(99):11, 2018. 46

work page 2018

[1] [1]

Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-$1$ Updates

A. Anandkumar, R. Ge, and M. Janzamin, Guaranteed non-orthogonal tensor decomposition via alternating rank- 1 updates, arXiv:1402.5180, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[2] [2]

Anthony and P

M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations . Cambridge University Press, Cambridge, 1999

work page 1999

[3] [3]

Bach, Breaking the curse of dimensionality with convex neural networks ,

F. Bach, Breaking the curse of dimensionality with convex neural networks ,

work page

[4] [4]

R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 1997

work page 1997

[5] [5]

J. J. Benedetto and M. Fickus, Finite normalized tight frames , Advances in Computational Mathematics, Vol. 18, No. 24, pp 357385, 2003 J. Mach. Learn. Res. 18 (2017), 1–53

work page 2003

[6] [6]

Bengio, P

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19 (NIPS 2006)

work page 2006

[7] [7]

A. L. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete. Neural Networks 5 (1) (1992), 117–127

work page 1992

[8] [8]

T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, High-performance OCR for printed English and Fraktur using LSTM networks , In: 12th International Conference on Document Analysis and Recognition (2013), 683–687

work page 2013

[9] [9]

Bruna and S

J. Bruna and S. Mallat. Invariant scattering convolution networks . IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):18721886, 2013

work page 2013

[10] [10]

Carlini and D

N. Carlini and D. Wagner, Towards evaluating the robustness of neural networks , In: 2017 IEEE Symposium on Security and Privacy (SP) (2017), pp. 39–57

work page 2017

[11] [11]

P. G. Casazza and N. Leonhard, Classes of ﬁnite equal norm Parseval frames , Contemporary Mathematics, 451, 2008

work page 2008

[12] [12]

Ciresan, U

D.C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, Multi-column deep neural network for traﬃc sign classiﬁcation , Neural Networks 32 (2012), 333–338

work page 2012

[13] [13]

Cohen, I

A. Cohen, I. Daubechies, R. DeVore, g. Kerkyacharian, and D. Picard. Capturing ridge functions in high dimensions from point queries . Constructive Approximation, 35(2):225–243, Apr 2012. 43

work page 2012

[14] [14]

P. Constantine, Active Subspaces: Emerging Ideas for Dimension Reduction in Parame- ter Studies, SIAM Spotlights 2., Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2015

work page 2015

[15] [15]

Constantine, E

P. Constantine, E. Dow, and Q. Wang, Active subspaces in theory and practice: Applications to kriging surfaces , SIAM J. Sci. Comput. 36 (2014), pp. A1500–A1524

work page 2014

[16] [16]

De Silva and L.-H

Vi. De Silva and L.-H. Lim, Tensor rank and the ill-posedness of the best low-rank approxi- mation problem, SIAM J. Matrix Anal. Appl. 30 (3) (2008), 1084–1127

work page 2008

[17] [17]

DeVore, K

R. DeVore, K. Oskolkov, and P. Petrushev, Approximation of feed-forward neural networks , Ann. Numer. Math. 4 (1997), 261–287

work page 1997

[18] [18]

Devroye and L

L. Devroye and L. Gy¨ orﬁ, Nonparametric Density Estimation, Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics, John Wiley & Sons Inc., New York, 1985

work page 1985

[19] [19]

Feﬀerman, Reconstructing a neural net from its output, Rev

C. Feﬀerman, Reconstructing a neural net from its output, Rev. Mat. Iberoam. 10 (3) (1994), 507–555

work page 1994

[20] [20]

Fornasier, K

M. Fornasier, K. Schnass, and J. Vyb´ ıral.Learning functions of few arbitrary linear param- eters in high dimensions . Found. Comput. Math., 12(2):229–262, April 2012

work page 2012

[21] [21]

Fornasier, J

M. Fornasier, J. Vyb´ ıral, and I. Daubechies. Robust and resource eﬃcient identiﬁcation of shallow neural networks by fewest samples . arXiv:1804.01592v2, https://arxiv.org/pdf/ 1804.01592.pdf, 2019

work page arXiv 2019

[22] [22]

Fiedler, M

C. Fiedler, M. Fornasier, T. Klock, and M. Rauchensteiner Robust and resource eﬃcient identiﬁcation of deep neural networks , in preparation

work page

[23] [23]

Foucart and H

S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkh¨ auser, 2013

work page 2013

[24] [24]

Tail bounds for all eigenvalues of a sum of random matrices

A. Gittens and J. A. Tropp. Tail bounds for all eigenvalues of a sum of random matrices . arXiv:1104.4513, Apr 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[25] [25]

Graves, A.-R

A. Graves, A.-R. Mohamed, and G. E. Hinton, Speech recognition with deep recurrent neural networks, In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), 6645–6649

work page 2013

[26] [26]

Golowich, A

N. Golowich, A. Rakhlin, O. Shamir, Size-independent sample complexity of neural networks, Proceedings of the 31st Conference On Learning Theory, 85, 297–299, 2018

work page 2018

[27] [27]

Grohs, D

P. Grohs, D. Perekrestenko, D. Elbraechter, H. Boelcskei,Deep neural network approximation theory, arXiv:1901.02220

work page arXiv 1901

[28] [28]

H˚ astad,Tensor rank is NP-complete, J

J. H˚ astad,Tensor rank is NP-complete, J. Algorithms 11 (4) (1990), 644-654

work page 1990

[29] [29]

Ch. J. Hillar and L.-H. Lim, Most tensor problems are NP-hard, J. ACM 60 (6) (2013), 1–45

work page 2013

[30] [30]

Hristache, A

M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation of theindex coeﬃcient in a single-index model. Annals of Statistics, pages 595623, 2001

work page 2001

[31] [31]

Ichimura

H. Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58(1-2):71120, 1993

work page 1993

[32] [32]

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

M. Janzamin, H. Sedghi, and A. Anandkumar, Beating the Perils of Non-Convexity: Guar- anteed Training of Neural Networks using Tensor Methods . arXiv:1506.08473, Jun 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

J. S. Judd, Neural network design and the complexity of learning, MIT press, 1990. 44

work page 1990

[34] [34]

Kawaguchi, Deep learning without poor local minima , Advances in Neural Information Processing Systems (NIPS 2016)

K. Kawaguchi, Deep learning without poor local minima , Advances in Neural Information Processing Systems (NIPS 2016)

work page 2016

[35] [35]

T. G. Kolda, Symmetric orthogonal tensor decomposition is trivial , arXiv:1503.01375, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[36] [36]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton,Imagenet classiﬁcation with deep convolutional neural networks, In: Advances in Neural Information Processing Systems (NIPS) (2012), 1–9

work page 2012

[37] [37]

Li, On principal hessian directions for data visualization and dimension reduction: an- other application of Stein’s Lemma , J

K. Li, On principal hessian directions for data visualization and dimension reduction: an- other application of Stein’s Lemma , J. Am. Stat. Assoc. 87 (420) (1992), 1025–1039

work page 1992

[38] [38]

Li, Interpolation by ridge polynomials and its application in neural networks , J

X. Li, Interpolation by ridge polynomials and its application in neural networks , J. Comput. Appl. Math. 144 (1-2) (2002), 197–209

work page 2002

[39] [39]

Light, Ridge functions, sigmoidal functions and neural networks , Approximation theory VII, Proc

W. Light, Ridge functions, sigmoidal functions and neural networks , Approximation theory VII, Proc. 7th Int. Symp., Austin/TX (USA) 1992, 163–206 (1993)

work page 1992

[40] [40]

R Magnus

J. R Magnus. On diﬀerentiating eigenvalues and eigenvectors. Econometric Theory, 1(2):179– 191, 1985

work page 1985

[41] [41]

Mayer, T

S. Mayer, T. Ullrich, and J. Vyb´ ıral, Entropy and sampling numbers of classes of ridge functions, Constr. Appr. 42 (2) (2015), 231–264

work page 2015

[42] [42]

S. Mei, T. Misiakiewicz, A. Montanari, Mean-ﬁeld theory of two-layers neural networks: dimension-free bounds and kernel limit , arXiv:1902.06015

work page internal anchor Pith review Pith/arXiv arXiv 1902

[43] [43]

On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

M. Mondelli and A. Montanari, On the connection between learning two-layers neural net- works and tensor decomposition . CoRR, abs/1802.07301, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

Moravˇ c´ ık, M

M. Moravˇ c´ ık, M. Schmid, N. Burch, V. Lis´ y, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling, Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker, Science 356, no. 6337 (2017), 508–513

work page 2017

[45] [45]

Nakatsukasa, T

Y. Nakatsukasa, T. Soma, and A. Uschmajew, Finding a low-rank basis in a matrix subspace. Mathematical Programming, 162(1-2):325–361, 2017

work page 2017

[46] [46]

P. P. Petrushev, Approximation by ridge functions and neural networks , SIAM J. Math. Anal. 30 (1) (1999), 155–189

work page 1999

[47] [47]

Pinkus, Approximating by ridge functions

A. Pinkus, Approximating by ridge functions . Le M´ ehaut´ e, Alain (ed.) et al., Surface ﬁtting and multiresolution methods. Vol. 2 of the proceedings of the 3rd international conference on Curves and surfaces, held in Chamonix-Mont-Blanc, France, June 27-July 3, 1996. Nashville, TN: Vanderbilt University Press. 279–292 (1997)

work page 1996

[48] [48]

Pinkus, Approximation theory of the MLP model in neural networks , Acta Numerica, Vol

A. Pinkus, Approximation theory of the MLP model in neural networks , Acta Numerica, Vol. 8, 143-195, 1999

work page 1999

[49] [49]

Q. Qu, J. Sun, and J.Wright, Finding a sparse vector in a subspace: Linear sparsity using alternating directions, IEEE Trans. Inform. Theory 62(10) (2016), 5855–5880

work page 2016

[50] [50]

Rellich and J

F. Rellich and J. Berkowitz. Perturbation theory of eigenvalue problems. CRC Press, 1969

work page 1969

[51] [51]

Orthogonal Decomposition of Symmetric Tensors

E. Robeva, Orthogonal decomposition of symmetric tensors , arXiv:1409.6685, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[52] [52]

G. M. Rotskoﬀ, E. Vanden-Eijnden, Neural networks as interacting particle systems: asymp- totic convexity of the loss landscape and universal scaling of the approximation error , arXiv:1805.00915, 2018

work page arXiv 2018

[53] [53]

Rudelson and R

M. Rudelson and R. Vershynin, Sampling from large matrices: An approach through geomet- ric functional analysis , J. ACM 54 (4), (2007), Art. 21, 19 pp. 45

work page 2007

[54] [54]

Provable approximation properties for deep neural networks

U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. CoRR, abs/1509.07385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[55] [55]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to al- gorithms. Cambridge University Press, 2014

work page 2014

[56] [56]

Silver, A

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser et al., Mastering the game of Go with deep neural networks and tree search , Nature 529, no. 7587 (2016), 484–489

work page 2016

[57] [57]

No bad local minima: Data independent training error guarantees for multilayer neural networks

D. Soudry and Y. Carmon, No bad local minima: Data independent training error guarantees for multilayer neural networks , arXiv:1605.08361

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Stallkamp, M

J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Man vs. computer: Benchmarking machine learning algorithms for traﬃc sign recognition , Neural Networks 32 (2012), 323–332

work page 2012

[59] [59]

Stein, Estimation of the mean of a multivariate normal distribution , Ann

C. Stein, Estimation of the mean of a multivariate normal distribution , Ann. Stat. 9 (1981), 1135–1151

work page 1981

[60] [60]

G. W. Stewart, Perturbation theory for the singular value decomposition , in SVD and Signal Processing, II, ed. R. J. Vacarro, Elsevier, 1991

work page 1991

[61] [61]

Sturm, S

I. Sturm, S. Lapuschkin, W. Samek, and K.-R. M¨ uller, Interpretable deep neural networks for single-trial EEG classiﬁcation , J. Neuroscience Methods 274 (2016), 141–145

work page 2016

[62] [62]

Tao, Topics in random matrix theory , Vol

T. Tao, Topics in random matrix theory , Vol. 132, American Mathematical Soc., 2012

work page 2012

[63] [63]

Tao, When are eigenvalues stable? , What’s new, Blog entry 28 October, 2008 https: //terrytao.wordpress.com/2008/10/28/when-are-eigenvalues-stable/

T. Tao, When are eigenvalues stable? , What’s new, Blog entry 28 October, 2008 https: //terrytao.wordpress.com/2008/10/28/when-are-eigenvalues-stable/

work page 2008

[64] [64]

J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10):2231–2242, 2004

work page 2004

[65] [65]

A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics

work page 1996

[66] [66]

Vershynin

R. Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer

work page 2018

[67] [67]

Wedin, Perturbation bounds in connection with singular value decomposition , BIT 12 (1972), 99–111

P.-A. Wedin, Perturbation bounds in connection with singular value decomposition , BIT 12 (1972), 99–111

work page 1972

[68] [68]

Wiatowski, P

T. Wiatowski, P. Grohs, and H. Boelcskei. Energy propagation in deep convolutional neural networks. IEEE Transactions on Information Theory, PP(99):11, 2018. 46

work page 2018