Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

Dmitry Vetrov; Ekaterina Lobacheva; Nadezhda Chirkova

arxiv: 2005.07292 · v1 · pith:AWOVRVAJnew · submitted 2020-05-14 · 💻 cs.LG · stat.ML

Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

Nadezhda Chirkova , Ekaterina Lobacheva , Dmitry Vetrov This is my paper

classification 💻 cs.LG stat.ML

keywords memoryensemblenumberdeepnetworkparameterssplitthinner

0 comments

read the original abstract

One of the generally accepted views of modern deep learning is that increasing the number of parameters usually leads to better quality. The two easiest ways to increase the number of parameters is to increase the size of the network, e.g. width, or to train a deep ensemble; both approaches improve the performance in practice. In this work, we consider a fixed memory budget setting, and investigate, what is more effective: to train a single wide network, or to perform a memory split -- to train an ensemble of several thinner networks, with the same total number of parameters? We find that, for large enough budgets, the number of networks in the ensemble, corresponding to the optimal memory split, is usually larger than one. Interestingly, this effect holds for the commonly used sizes of the standard architectures. For example, one WideResNet-28-10 achieves significantly worse test accuracy on CIFAR-100 than an ensemble of sixteen thinner WideResNets: 80.6% and 82.52% correspondingly. We call the described effect the Memory Split Advantage and show that it holds for a variety of datasets and model architectures.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes
cs.LG 2026-05 unverdicted novelty 5.0

Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.