Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

Samet Oymak, Zalan Fabian, Mingchen Li, Mahdi Soltanolkotabi · 2019 · cs.LG · arXiv 1906.05392

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Modern neural network architectures often generalize well despite containing many more parameters than the size of the training dataset. This paper explores the generalization capabilities of neural networks trained via gradient descent. We develop a data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network. Our results help demystify why training and generalization is easier on clean and structured datasets and harder on noisy and unstructured datasets as well as how the network size affects the evolution of the train and test errors during training. Specifically, we use a control knob to split the Jacobian spectum into "information" and "nuisance" spaces associated with the large and small singular values. We show that over the information space learning is fast and one can quickly train a model with zero training loss that can also generalize well. Over the nuisance space training is slower and early stopping can help with generalization at the expense of some bias. We also show that the overall generalization capability of the network is controlled by how well the label vector is aligned with the information space. A key feature of our results is that even constant width neural nets can provably generalize for sufficiently nice datasets. We conduct various numerical experiments on deep networks that corroborate our theoretical findings and demonstrate that: (i) the Jacobian of typical neural networks exhibit low-rank structure with a few large singular values and many small ones leading to a low-dimensional information space, (ii) over the information space learning is fast and most of the label vector falls on this space, and (iii) label noise falls on the nuisance space and impedes optimization/generalization.

representative citing papers

Pointwise Generalization in Deep Neural Networks

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.

On the Convergence Rate of LoRA Gradient Descent

cs.LG · 2025-12-20 · unverdicted · novelty 7.0

LoRA gradient descent converges to a stationary point at rate O(1/log T).

LoRA: Low-Rank Adaptation of Large Language Models

cs.CL · 2021-06-17 · accept · novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

cs.LG · 2026-02-17 · unverdicted · novelty 6.0

CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.

citing papers explorer

Showing 4 of 4 citing papers.

Pointwise Generalization in Deep Neural Networks cs.LG · 2026-05-18 · unverdicted · none · ref 100 · internal anchor
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
On the Convergence Rate of LoRA Gradient Descent cs.LG · 2025-12-20 · unverdicted · none · ref 7 · internal anchor
LoRA gradient descent converges to a stationary point at rate O(1/log T).
LoRA: Low-Rank Adaptation of Large Language Models cs.CL · 2021-06-17 · accept · none · ref 41
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing cs.LG · 2026-02-17 · unverdicted · none · ref 18 · internal anchor
CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.

Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

fields

years

verdicts

representative citing papers

citing papers explorer