pith. sign in

arxiv: 1907.00485 · v1 · pith:JCODI7VMnew · submitted 2019-06-30 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks

Pith reviewed 2026-05-25 12:20 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML
keywords neural network identificationtwo hidden layersHessian finite differencessubspace recoveryrobust nonlinear programstable recoveryactivation shiftssample complexity
0
0 comments X

The pith

Two-hidden-layer neural networks can be identified from few samples by approximating a weight subspace from Hessian finite differences and solving a nonlinear program.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to recover the weights of a two-hidden-layer neural network of the form f(x) = 1^T h(B^T g(A^T x)) from a small number of query samples. It does this by actively sampling finite difference approximations to the network's Hessians, which together approximate the subspace spanned by symmetric tensors built from the first-layer weights and certain combinations involving the second layer. A robust nonlinear program then isolates the individual rank-one tensors inside that subspace. This approach comes with guarantees of stable recovery when certain conditions can be checked after sampling, and it also recovers activation shifts. A sympathetic reader would care because the method is fully constructive and reduces the opaque nature of network training by giving explicit sample complexity bounds.

Core claim

By gathering approximate Hessians via finite differences, the method approximates the matrix subspace W spanned by the symmetric tensors a1⊗a1,...,am0⊗am0 from the first layer weights together with the entangled tensors vℓ⊗vℓ from first and second layer combinations, then identifies the rank-one tensors by solving a robust nonlinear program, providing stable recovery guarantees under a posteriori verifiable conditions, and further attributes weights to layers and estimates activation shifts via adapted gradient descent.

What carries the argument

The matrix subspace W spanned by symmetric tensors a_i ⊗ a_i and entangled v_ℓ ⊗ v_ℓ recovered from finite-difference Hessian approximations, which enables isolation of rank-one tensors via nonlinear program.

If this is right

  • Stable recovery of the network weights up to intrinsic symmetries under a posteriori verifiable conditions.
  • Correct attribution of approximate weights to the first or second layer.
  • Estimation of shifts of the activation functions of the first layer via adapted gradient descent, allowing exact computation of the matrix G0.
  • Fully constructive identification with quantifiable sample complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subspace recovery step via Hessians could be iterated to identify deeper networks by peeling off layers successively.
  • The recovered weights might serve as improved initializations for standard gradient-based training of similar two-layer architectures.
  • The a posteriori verifiable conditions could certify successful identification in practice without access to the original training data.

Load-bearing premise

The finite-difference approximations to the Hessians are accurate enough to recover the subspace spanned by the symmetric tensors formed by the weights.

What would settle it

Observing that the robust nonlinear program returns incorrect rank-one tensors despite the subspace W being accurately recovered from the Hessians would falsify the recovery guarantee.

Figures

Figures reproduced from arXiv: 1907.00485 by Massimo Fornasier, Michael Rauchensteiner, Timo Klock.

Figure 1
Figure 1. Figure 1: Illustration of the relationship between [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Error in approximating W for perturbed orthogonal weights and different activation functions. The meaning of estimate (63) is explained by the following mechanism: whenever the deviation of an iteration Mj of Algorithm 3 from being a rank-1 matrix in W is large, in the sense that kPW(u1 ⊗ u1)kF is small, then the constant Θ =  CW cW 1/2 P λj>0 λjkPW(uj ⊗ uj )kF  is also small and the iteration Mj+1 = F… view at source ↗
Figure 3
Figure 3. Figure 3: False positive and recovery rates for perturbed orthogonal weights and for different [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Error in approximating W for weights sampled independently from the unit sphere, and for different activation functions. • the normalized projection error kPˆW−PWk 2 F m0+m1 , • a false positive rate FP(T) = #{j:E( ˆwj )>T} m0+m1 , where T > 0 is a threshold, and E(u) is defined by, E(u) := min w∈{±ai,±v`:i∈[m0],`∈[m1]} ku − wk 2 2 , • recovery rate Ra(T) = #{i:E(ai)<T} m0 , and Rv(T) = #{`:E(v`)<T} m1 , w… view at source ↗
Figure 5
Figure 5. Figure 5: False positive and recovery rates for weights sampled uniformly at random from the [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We illustrate the trajectories t → k∇f(tw)k2 for w ∈ {wˆj : j ∈ [m]}. The blue trajectories are those for w ∈ {wˆj ≈ ai for some i} and the red trajectories are those for w ∈ {wˆj ≈ v` for some `}. We can observe the separation of the trajectories due to the different decay properties. 6.1 Distinguishing first and second layer weights Attributing approximate entangled weights to first or second layer is ge… view at source ↗
read the original abstract

We address the structure identification and the uniform approximation of two fully nonlinear layer neural networks of the type $f(x)=1^T h(B^T g(A^T x))$ on $\mathbb R^d$ from a small number of query samples. We approach the problem by sampling actively finite difference approximations to Hessians of the network. Gathering several approximate Hessians allows reliably to approximate the matrix subspace $\mathcal W$ spanned by symmetric tensors $a_1 \otimes a_1 ,\dots,a_{m_0}\otimes a_{m_0}$ formed by weights of the first layer together with the entangled symmetric tensors $v_1 \otimes v_1 ,\dots,v_{m_1}\otimes v_{m_1}$, formed by suitable combinations of the weights of the first and second layer as $v_\ell=A G_0 b_\ell/\|A G_0 b_\ell\|_2$, $\ell \in [m_1]$, for a diagonal matrix $G_0$ depending on the activation functions of the first layer. The identification of the 1-rank symmetric tensors within $\mathcal W$ is then performed by the solution of a robust nonlinear program. We provide guarantees of stable recovery under a posteriori verifiable conditions. We further address the correct attribution of approximate weights to the first or second layer. By using a suitably adapted gradient descent iteration, it is possible then to estimate, up to intrinsic symmetries, the shifts of the activations functions of the first layer and compute exactly the matrix $G_0$. Our method of identification of the weights of the network is fully constructive, with quantifiable sample complexity, and therefore contributes to dwindle the black-box nature of the network training phase. We corroborate our theoretical results by extensive numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims a constructive method to identify weights and approximate two-hidden-layer networks f(x)=1^T h(B^T g(A^T x)) from few queries: active finite-difference Hessian sampling recovers the subspace W spanned by rank-1 tensors a_i ⊗ a_i and entangled v_ℓ ⊗ v_ℓ (with v_ℓ = A G_0 b_ℓ / ||A G_0 b_ℓ||_2), a robust nonlinear program isolates the tensors, a-posteriori verifiable conditions yield stable recovery guarantees, layer attribution is resolved, and gradient-descent estimates first-layer shifts and G_0 exactly. The approach is fully constructive with quantifiable sample complexity and is supported by numerical experiments.

Significance. If the recovery guarantees hold, the work supplies an explicit, query-based identification procedure that quantifies sample needs and reduces the black-box character of network training; the constructive nature and numerical corroboration are explicit strengths.

major comments (2)
  1. [Abstract (sampling Hessians and subspace W paragraph)] Abstract (sampling Hessians and subspace W paragraph): the central claim that 'gathering several approximate Hessians allows reliably to approximate the matrix subspace W' is load-bearing for all downstream steps (nonlinear program, layer attribution, shift estimation). No explicit finite-difference error bounds, sampling-density requirements, or conditioning assumptions on G_0 appear, so it is unclear under what verifiable conditions the estimated W remains close enough to the true span for the subsequent rank-1 identification to succeed.
  2. [Abstract (guarantees paragraph)] Abstract (guarantees paragraph): the statement 'We provide guarantees of stable recovery under a posteriori verifiable conditions' is the main theoretical contribution, yet the abstract supplies neither the form of these conditions nor how they are checked after the nonlinear program; without this, the claim that the pipeline yields stable recovery cannot be assessed.
minor comments (2)
  1. The phrase '1-rank symmetric tensors' should be replaced by the standard 'rank-1' throughout for clarity.
  2. Dimension symbols m_0, m_1, d and the precise meaning of the diagonal matrix G_0 are introduced only implicitly; an early explicit statement would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need for greater clarity in the abstract regarding the theoretical foundations. We address each major comment below and will revise the abstract accordingly to better convey the error bounds, sampling requirements, and verification procedures while preserving its concise nature. The full details remain in the body of the manuscript.

read point-by-point responses
  1. Referee: [Abstract (sampling Hessians and subspace W paragraph)] Abstract (sampling Hessians and subspace W paragraph): the central claim that 'gathering several approximate Hessians allows reliably to approximate the matrix subspace W' is load-bearing for all downstream steps (nonlinear program, layer attribution, shift estimation). No explicit finite-difference error bounds, sampling-density requirements, or conditioning assumptions on G_0 appear, so it is unclear under what verifiable conditions the estimated W remains close enough to the true span for the subsequent rank-1 identification to succeed.

    Authors: The finite-difference error bounds, sampling-density requirements, and dependence on the conditioning of G_0 are derived explicitly in Section 3. Theorem 3.1 bounds the perturbation of each approximate Hessian, while Theorem 3.2 controls the resulting subspace distance in terms of the number of samples, the minimal singular value of G_0, and the activation Lipschitz constants. These quantities are a posteriori verifiable by inspecting the numerical rank and conditioning of the collected Hessian matrices. We will revise the abstract to reference these results and state the key assumptions on G_0. revision: yes

  2. Referee: [Abstract (guarantees paragraph)] Abstract (guarantees paragraph): the statement 'We provide guarantees of stable recovery under a posteriori verifiable conditions' is the main theoretical contribution, yet the abstract supplies neither the form of these conditions nor how they are checked after the nonlinear program; without this, the claim that the pipeline yields stable recovery cannot be assessed.

    Authors: The form of the conditions and the post-nonlinear-program verification procedure are stated in Theorem 4.2 and the accompanying Algorithm 1. After recovery, one checks the residual norm of the rank-1 decomposition against the subspace error bound and verifies a minimum angular separation between the recovered tensors; both checks use only the sampled Hessians and the program output. We agree the abstract is too terse on this point and will update it to summarize the verification steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's identification pipeline begins from external query samples, computes finite-difference Hessian approximations, recovers subspace W, solves a robust nonlinear program for rank-1 tensors, performs layer attribution, and applies gradient descent to recover shifts and G_0. All steps are presented as constructive with quantifiable sample complexity and a-posteriori verifiable recovery conditions. No step reduces by construction to a fitted input, self-definition, or unverified self-citation chain; the central claims remain independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the network exactly matching the stated functional form and on the existence of verifiable conditions that certify subspace recovery; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption The network is exactly of the form f(x) = 1^T h(B^T g(A^T x)) on R^d
    Stated as the problem setup in the first sentence of the abstract.
  • domain assumption Finite-difference approximations to Hessians are accurate enough to recover the subspace W spanned by the indicated symmetric tensors
    Invoked when the abstract says 'sampling actively finite difference approximations to Hessians ... allows reliably to approximate the matrix subspace W'.

pith-pipeline@v0.9.0 · 5862 in / 1457 out tokens · 45822 ms · 2026-05-25T12:20:32.648901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 9 internal anchors

  1. [1]

    Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-$1$ Updates

    A. Anandkumar, R. Ge, and M. Janzamin, Guaranteed non-orthogonal tensor decomposition via alternating rank- 1 updates, arXiv:1402.5180, 2014

  2. [2]

    Anthony and P

    M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations . Cambridge University Press, Cambridge, 1999

  3. [3]

    Bach, Breaking the curse of dimensionality with convex neural networks ,

    F. Bach, Breaking the curse of dimensionality with convex neural networks ,

  4. [4]

    R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 1997

  5. [5]

    J. J. Benedetto and M. Fickus, Finite normalized tight frames , Advances in Computational Mathematics, Vol. 18, No. 24, pp 357385, 2003 J. Mach. Learn. Res. 18 (2017), 1–53

  6. [6]

    Bengio, P

    Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19 (NIPS 2006)

  7. [7]

    A. L. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete. Neural Networks 5 (1) (1992), 117–127

  8. [8]

    T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, High-performance OCR for printed English and Fraktur using LSTM networks , In: 12th International Conference on Document Analysis and Recognition (2013), 683–687

  9. [9]

    Bruna and S

    J. Bruna and S. Mallat. Invariant scattering convolution networks . IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):18721886, 2013

  10. [10]

    Carlini and D

    N. Carlini and D. Wagner, Towards evaluating the robustness of neural networks , In: 2017 IEEE Symposium on Security and Privacy (SP) (2017), pp. 39–57

  11. [11]

    P. G. Casazza and N. Leonhard, Classes of finite equal norm Parseval frames , Contemporary Mathematics, 451, 2008

  12. [12]

    Ciresan, U

    D.C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, Multi-column deep neural network for traffic sign classification , Neural Networks 32 (2012), 333–338

  13. [13]

    Cohen, I

    A. Cohen, I. Daubechies, R. DeVore, g. Kerkyacharian, and D. Picard. Capturing ridge functions in high dimensions from point queries . Constructive Approximation, 35(2):225–243, Apr 2012. 43

  14. [14]

    P. Constantine, Active Subspaces: Emerging Ideas for Dimension Reduction in Parame- ter Studies, SIAM Spotlights 2., Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2015

  15. [15]

    Constantine, E

    P. Constantine, E. Dow, and Q. Wang, Active subspaces in theory and practice: Applications to kriging surfaces , SIAM J. Sci. Comput. 36 (2014), pp. A1500–A1524

  16. [16]

    De Silva and L.-H

    Vi. De Silva and L.-H. Lim, Tensor rank and the ill-posedness of the best low-rank approxi- mation problem, SIAM J. Matrix Anal. Appl. 30 (3) (2008), 1084–1127

  17. [17]

    DeVore, K

    R. DeVore, K. Oskolkov, and P. Petrushev, Approximation of feed-forward neural networks , Ann. Numer. Math. 4 (1997), 261–287

  18. [18]

    Devroye and L

    L. Devroye and L. Gy¨ orfi, Nonparametric Density Estimation, Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics, John Wiley & Sons Inc., New York, 1985

  19. [19]

    Fefferman, Reconstructing a neural net from its output, Rev

    C. Fefferman, Reconstructing a neural net from its output, Rev. Mat. Iberoam. 10 (3) (1994), 507–555

  20. [20]

    Fornasier, K

    M. Fornasier, K. Schnass, and J. Vyb´ ıral.Learning functions of few arbitrary linear param- eters in high dimensions . Found. Comput. Math., 12(2):229–262, April 2012

  21. [21]

    Fornasier, J

    M. Fornasier, J. Vyb´ ıral, and I. Daubechies. Robust and resource efficient identification of shallow neural networks by fewest samples . arXiv:1804.01592v2, https://arxiv.org/pdf/ 1804.01592.pdf, 2019

  22. [22]

    Fiedler, M

    C. Fiedler, M. Fornasier, T. Klock, and M. Rauchensteiner Robust and resource efficient identification of deep neural networks , in preparation

  23. [23]

    Foucart and H

    S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkh¨ auser, 2013

  24. [24]

    Tail bounds for all eigenvalues of a sum of random matrices

    A. Gittens and J. A. Tropp. Tail bounds for all eigenvalues of a sum of random matrices . arXiv:1104.4513, Apr 2011

  25. [25]

    Graves, A.-R

    A. Graves, A.-R. Mohamed, and G. E. Hinton, Speech recognition with deep recurrent neural networks, In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), 6645–6649

  26. [26]

    Golowich, A

    N. Golowich, A. Rakhlin, O. Shamir, Size-independent sample complexity of neural networks, Proceedings of the 31st Conference On Learning Theory, 85, 297–299, 2018

  27. [27]

    Grohs, D

    P. Grohs, D. Perekrestenko, D. Elbraechter, H. Boelcskei,Deep neural network approximation theory, arXiv:1901.02220

  28. [28]

    H˚ astad,Tensor rank is NP-complete, J

    J. H˚ astad,Tensor rank is NP-complete, J. Algorithms 11 (4) (1990), 644-654

  29. [29]

    Ch. J. Hillar and L.-H. Lim, Most tensor problems are NP-hard, J. ACM 60 (6) (2013), 1–45

  30. [30]

    Hristache, A

    M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation of theindex coefficient in a single-index model. Annals of Statistics, pages 595623, 2001

  31. [31]

    Ichimura

    H. Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58(1-2):71120, 1993

  32. [32]

    Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

    M. Janzamin, H. Sedghi, and A. Anandkumar, Beating the Perils of Non-Convexity: Guar- anteed Training of Neural Networks using Tensor Methods . arXiv:1506.08473, Jun 2015

  33. [33]

    J. S. Judd, Neural network design and the complexity of learning, MIT press, 1990. 44

  34. [34]

    Kawaguchi, Deep learning without poor local minima , Advances in Neural Information Processing Systems (NIPS 2016)

    K. Kawaguchi, Deep learning without poor local minima , Advances in Neural Information Processing Systems (NIPS 2016)

  35. [35]

    T. G. Kolda, Symmetric orthogonal tensor decomposition is trivial , arXiv:1503.01375, 2015

  36. [36]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton,Imagenet classification with deep convolutional neural networks, In: Advances in Neural Information Processing Systems (NIPS) (2012), 1–9

  37. [37]

    Li, On principal hessian directions for data visualization and dimension reduction: an- other application of Stein’s Lemma , J

    K. Li, On principal hessian directions for data visualization and dimension reduction: an- other application of Stein’s Lemma , J. Am. Stat. Assoc. 87 (420) (1992), 1025–1039

  38. [38]

    Li, Interpolation by ridge polynomials and its application in neural networks , J

    X. Li, Interpolation by ridge polynomials and its application in neural networks , J. Comput. Appl. Math. 144 (1-2) (2002), 197–209

  39. [39]

    Light, Ridge functions, sigmoidal functions and neural networks , Approximation theory VII, Proc

    W. Light, Ridge functions, sigmoidal functions and neural networks , Approximation theory VII, Proc. 7th Int. Symp., Austin/TX (USA) 1992, 163–206 (1993)

  40. [40]

    R Magnus

    J. R Magnus. On differentiating eigenvalues and eigenvectors. Econometric Theory, 1(2):179– 191, 1985

  41. [41]

    Mayer, T

    S. Mayer, T. Ullrich, and J. Vyb´ ıral, Entropy and sampling numbers of classes of ridge functions, Constr. Appr. 42 (2) (2015), 231–264

  42. [42]

    S. Mei, T. Misiakiewicz, A. Montanari, Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , arXiv:1902.06015

  43. [43]

    On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

    M. Mondelli and A. Montanari, On the connection between learning two-layers neural net- works and tensor decomposition . CoRR, abs/1802.07301, 2018

  44. [44]

    Moravˇ c´ ık, M

    M. Moravˇ c´ ık, M. Schmid, N. Burch, V. Lis´ y, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling, Deepstack: Expert-level artificial intelligence in heads-up no-limit poker, Science 356, no. 6337 (2017), 508–513

  45. [45]

    Nakatsukasa, T

    Y. Nakatsukasa, T. Soma, and A. Uschmajew, Finding a low-rank basis in a matrix subspace. Mathematical Programming, 162(1-2):325–361, 2017

  46. [46]

    P. P. Petrushev, Approximation by ridge functions and neural networks , SIAM J. Math. Anal. 30 (1) (1999), 155–189

  47. [47]

    Pinkus, Approximating by ridge functions

    A. Pinkus, Approximating by ridge functions . Le M´ ehaut´ e, Alain (ed.) et al., Surface fitting and multiresolution methods. Vol. 2 of the proceedings of the 3rd international conference on Curves and surfaces, held in Chamonix-Mont-Blanc, France, June 27-July 3, 1996. Nashville, TN: Vanderbilt University Press. 279–292 (1997)

  48. [48]

    Pinkus, Approximation theory of the MLP model in neural networks , Acta Numerica, Vol

    A. Pinkus, Approximation theory of the MLP model in neural networks , Acta Numerica, Vol. 8, 143-195, 1999

  49. [49]

    Q. Qu, J. Sun, and J.Wright, Finding a sparse vector in a subspace: Linear sparsity using alternating directions, IEEE Trans. Inform. Theory 62(10) (2016), 5855–5880

  50. [50]

    Rellich and J

    F. Rellich and J. Berkowitz. Perturbation theory of eigenvalue problems. CRC Press, 1969

  51. [51]

    Orthogonal Decomposition of Symmetric Tensors

    E. Robeva, Orthogonal decomposition of symmetric tensors , arXiv:1409.6685, 2014

  52. [52]

    G. M. Rotskoff, E. Vanden-Eijnden, Neural networks as interacting particle systems: asymp- totic convexity of the loss landscape and universal scaling of the approximation error , arXiv:1805.00915, 2018

  53. [53]

    Rudelson and R

    M. Rudelson and R. Vershynin, Sampling from large matrices: An approach through geomet- ric functional analysis , J. ACM 54 (4), (2007), Art. 21, 19 pp. 45

  54. [54]

    Provable approximation properties for deep neural networks

    U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. CoRR, abs/1509.07385, 2015

  55. [55]

    Shalev-Shwartz and S

    S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to al- gorithms. Cambridge University Press, 2014

  56. [56]

    Silver, A

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser et al., Mastering the game of Go with deep neural networks and tree search , Nature 529, no. 7587 (2016), 484–489

  57. [57]

    No bad local minima: Data independent training error guarantees for multilayer neural networks

    D. Soudry and Y. Carmon, No bad local minima: Data independent training error guarantees for multilayer neural networks , arXiv:1605.08361

  58. [58]

    Stallkamp, M

    J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , Neural Networks 32 (2012), 323–332

  59. [59]

    Stein, Estimation of the mean of a multivariate normal distribution , Ann

    C. Stein, Estimation of the mean of a multivariate normal distribution , Ann. Stat. 9 (1981), 1135–1151

  60. [60]

    G. W. Stewart, Perturbation theory for the singular value decomposition , in SVD and Signal Processing, II, ed. R. J. Vacarro, Elsevier, 1991

  61. [61]

    Sturm, S

    I. Sturm, S. Lapuschkin, W. Samek, and K.-R. M¨ uller, Interpretable deep neural networks for single-trial EEG classification , J. Neuroscience Methods 274 (2016), 141–145

  62. [62]

    Tao, Topics in random matrix theory , Vol

    T. Tao, Topics in random matrix theory , Vol. 132, American Mathematical Soc., 2012

  63. [63]

    Tao, When are eigenvalues stable? , What’s new, Blog entry 28 October, 2008 https: //terrytao.wordpress.com/2008/10/28/when-are-eigenvalues-stable/

    T. Tao, When are eigenvalues stable? , What’s new, Blog entry 28 October, 2008 https: //terrytao.wordpress.com/2008/10/28/when-are-eigenvalues-stable/

  64. [64]

    J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10):2231–2242, 2004

  65. [65]

    A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics

  66. [66]

    Vershynin

    R. Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer

  67. [67]

    Wedin, Perturbation bounds in connection with singular value decomposition , BIT 12 (1972), 99–111

    P.-A. Wedin, Perturbation bounds in connection with singular value decomposition , BIT 12 (1972), 99–111

  68. [68]

    Wiatowski, P

    T. Wiatowski, P. Grohs, and H. Boelcskei. Energy propagation in deep convolutional neural networks. IEEE Transactions on Information Theory, PP(99):11, 2018. 46