Do Deep Nets Really Need to be Deep?

· 2013 · cs.LG · arXiv 1312.6184

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on the TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.

representative citing papers

On the Generalization of Knowledge Distillation: An Information-Theoretic View

cs.IT · 2026-05-13 · unverdicted · novelty 7.0

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.

Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

cs.AI · 2021-12-21 · unverdicted · novelty 4.0

Proposes a modality relation distillation method that transfers teacher modality relationships via the modality-level Gram Matrix.

citing papers explorer

Showing 2 of 2 citing papers.

On the Generalization of Knowledge Distillation: An Information-Theoretic View cs.IT · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix cs.AI · 2021-12-21 · unverdicted · none · ref 4 · internal anchor
Proposes a modality relation distillation method that transfers teacher modality relationships via the modality-level Gram Matrix.

Do Deep Nets Really Need to be Deep?

fields

years

verdicts

representative citing papers

citing papers explorer