On improving deep learning generalization with adaptive sparse connectivity

Decebal Constantin Mocanu; Mykola Pechenizkiy; Shiwei Liu

arxiv: 1906.11626 · v1 · pith:OKMUOYQKnew · submitted 2019-06-27 · 💻 cs.NE · cs.LG

On improving deep learning generalization with adaptive sparse connectivity

Shiwei Liu , Decebal Constantin Mocanu , Mykola Pechenizkiy This is my paper

Pith reviewed 2026-05-25 13:57 UTC · model grok-4.3

classification 💻 cs.NE cs.LG

keywords sparse neural networksgeneralizationadaptive sparse connectivityneuron pruningSparse Evolutionary Trainingmultilayer perceptrondeep learningparameter budget

0 comments

The pith

Intrinsically sparse neural networks with adaptive sparse connectivity generalize better than fully-connected networks under a strict parameter budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper empirically demonstrates that neural networks kept sparse by design during training, through adaptive connection changes, achieve better generalization than dense networks on classification tasks with limited data. It introduces a training method that merges Sparse Evolutionary Training with neuron pruning to eliminate about half the hidden neurons while keeping the number of parameters linear in the neuron count. Experiments on multilayer perceptrons across 15 datasets show competitive accuracy and generalization. If correct, this indicates that maintaining a fixed parameter budget via sparsity can serve as an effective regularizer.

Core claim

Intrinsically sparse neural networks with adaptive sparse connectivity, which by design have a strict parameter budget during the training phase, have better generalization capabilities than their fully-connected counterparts. The proposed technique combines the Sparse Evolutionary Training procedure with neurons pruning to zero out around 50% of the hidden neurons during training, while having a linear number of parameters to optimize with respect to the number of neurons, yielding competitive classification and generalization performance on 15 datasets.

What carries the argument

Adaptive sparse connectivity enforced by combining Sparse Evolutionary Training (SET) with neuron pruning, which maintains a strict parameter budget and prunes half the hidden neurons.

If this is right

Sparse models show improved generalization compared to dense counterparts.
The method achieves competitive classification performance on 15 datasets.
Parameter count remains linear with respect to the number of neurons.
About 50% of hidden neurons can be pruned without harming performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might apply to other network types like CNNs for vision tasks.
Sparsity during training could complement or replace techniques like weight decay or dropout.
Fixed parameter budgets may help in resource-constrained training scenarios.
Further tests could vary the pruning rate to find optimal sparsity levels.

Load-bearing premise

Observed generalization improvements result from the adaptive sparse connectivity itself, not from specific dataset properties, the 50% pruning rate, or unmentioned differences in training dynamics.

What would settle it

Run controlled experiments training sparse adaptive and dense networks on identical datasets with matched parameter counts and hyperparameters, then check if the sparse version has higher test accuracy; failure to do so would disprove the claim.

Figures

Figures reproduced from arXiv: 1906.11626 by Decebal Constantin Mocanu, Mykola Pechenizkiy, Shiwei Liu.

**Figure 1.** Figure 1: Influence of hidden neurons removal (from the first hidden layer) on accuracy on the Lung-discrete dataset. of bipartite layers of neurons to evolve towards a scale-free topology, while learning to fit the data characteristics. After each training epoch, the connections having weights closest to zero are removed (magnitude based removal). After that, new connections (in the same amount as the removed one… view at source ↗

**Figure 2.** Figure 2: NPSET-MLP, SET-MLP and Dense-MLP generalization capabilities reflected by their learning curves. MLP) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Large neural networks are very successful in various tasks. However, with limited data, the generalization capabilities of deep neural networks are also very limited. In this paper, we empirically start showing that intrinsically sparse neural networks with adaptive sparse connectivity, which by design have a strict parameter budget during the training phase, have better generalization capabilities than their fully-connected counterparts. Besides this, we propose a new technique to train these sparse models by combining the Sparse Evolutionary Training (SET) procedure with neurons pruning. Operated on MultiLayer Perceptron (MLP) and tested on 15 datasets, our proposed technique zeros out around 50% of the hidden neurons during training, while having a linear number of parameters to optimize with respect to the number of neurons. The results show a competitive classification and generalization performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SET plus neuron pruning gives competitive sparse MLP results on 15 datasets, but the experiments do not isolate adaptive rewiring from simple parameter reduction.

read the letter

This paper's core observation is that MLPs trained with SET combined with 50% neuron pruning reach competitive accuracy on 15 classification datasets while using a strict parameter budget. The new element is the addition of neuron pruning to the existing SET procedure, which keeps the number of parameters linear in the number of neurons and removes half the hidden units during training. That extension is straightforward and the multi-dataset evaluation is a reasonable check on whether the approach holds up across tasks. The work sits squarely in the sparse-training literature and offers a practical tweak rather than a new framework. The main limitation is that the experiments compare the resulting sparse models only to fully-connected counterparts. There are no matched-parameter dense baselines, no fixed-sparsity controls without the evolutionary rewiring, and no ablation that disables adaptation while preserving the final sparsity level. Without those, it is not possible to attribute any generalization difference specifically to the adaptive connectivity mechanism instead of the reduced parameter count or other optimization details. The paper is empirical throughout, so the strength of the central claim depends on those controls being added. Readers working on sparse neural network training will find the reported numbers and the simple implementation useful. The work is coherent enough on its own terms to merit peer review, though the authors should address the missing ablations before publication.

Referee Report

3 major / 1 minor

Summary. The paper claims that intrinsically sparse neural networks using adaptive sparse connectivity (via a combination of Sparse Evolutionary Training (SET) and neuron pruning) achieve better generalization than fully-connected counterparts. The proposed method is tested on MLPs across 15 datasets, where it zeros out approximately 50% of hidden neurons during training while maintaining a linear number of parameters with respect to the number of neurons, yielding competitive classification and generalization performance.

Significance. If substantiated with proper controls, the result would provide empirical support for the idea that adaptive sparsity mechanisms can enhance generalization under parameter budgets, offering a potential route to more efficient deep learning models in data-limited settings. The multi-dataset evaluation and the explicit combination of SET with pruning are concrete strengths that could be built upon.

major comments (3)

[Experiments] Experiments section: the central claim that adaptive sparse connectivity improves generalization over fully-connected networks is not isolated from the effect of the strict parameter budget, as the manuscript reports no matched-parameter dense baselines (i.e., dense MLPs with the same number of parameters as the final sparse models) and no fixed-sparsity (non-adaptive) controls that disable the evolutionary rewiring while preserving the final sparsity level.
[Experiments] Experiments section: no ablation is presented that disables the adaptive rewiring component of SET while keeping the 50% neuron pruning and parameter budget fixed, leaving open whether any observed gains are attributable to the adaptive mechanism itself rather than reduced parameter count or optimization dynamics.
[Results] Results section: the reported performance on the 15 datasets lacks details on statistical tests, variance across multiple runs, exact data splits, and baseline architectures, making it impossible to assess whether the generalization advantage is robust or reproducible.

minor comments (1)

[Abstract] The abstract states that the method 'zeros out around 50% of the hidden neurons' but does not clarify whether this is a fixed target or an emergent outcome of the combined SET+pruning procedure; a precise description of the pruning schedule would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important controls needed to strengthen the isolation of adaptive sparsity effects. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that adaptive sparse connectivity improves generalization over fully-connected networks is not isolated from the effect of the strict parameter budget, as the manuscript reports no matched-parameter dense baselines (i.e., dense MLPs with the same number of parameters as the final sparse models) and no fixed-sparsity (non-adaptive) controls that disable the evolutionary rewiring while preserving the final sparsity level.

Authors: We agree that matched-parameter dense baselines and fixed-sparsity controls are necessary to better isolate the role of adaptive connectivity from the parameter budget itself. The original comparisons were to standard fully-connected MLPs (which use more parameters), and the sparse models achieve competitive results under a strict budget. In revision, we will add dense MLPs with parameter counts matched to the final sparse models and non-adaptive fixed-sparsity controls that preserve the same sparsity level without rewiring. revision: yes
Referee: [Experiments] Experiments section: no ablation is presented that disables the adaptive rewiring component of SET while keeping the 50% neuron pruning and parameter budget fixed, leaving open whether any observed gains are attributable to the adaptive mechanism itself rather than reduced parameter count or optimization dynamics.

Authors: This is a fair observation. While SET's adaptive rewiring is a core component of the proposed combination with neuron pruning, an explicit ablation would clarify its contribution. We will add this ablation in the revised manuscript by comparing the full adaptive SET+pruning approach against a variant that applies the same neuron pruning and parameter budget but disables evolutionary rewiring after initialization. revision: yes
Referee: [Results] Results section: the reported performance on the 15 datasets lacks details on statistical tests, variance across multiple runs, exact data splits, and baseline architectures, making it impossible to assess whether the generalization advantage is robust or reproducible.

Authors: We acknowledge that these experimental details were not sufficiently reported. The experiments involved multiple runs, but variance, statistical tests, exact splits, and architecture specifications were omitted from the results section. In the revision, we will include standard deviations across runs, specify data splits and preprocessing, detail all baseline architectures, and report statistical significance tests (such as paired t-tests) to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or fitted predictions

full rationale

The paper is an empirical investigation that proposes combining SET with neuron pruning, trains MLPs on 15 datasets, and reports classification performance. It contains no mathematical derivations, first-principles predictions, or quantities defined in terms of fitted parameters that are later presented as independent results. All claims rest on direct experimental comparisons rather than any self-referential reduction or self-citation chain. The central claim of improved generalization is therefore not forced by construction from the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study; it introduces no new mathematical axioms or postulated entities. The only adjustable quantity mentioned is the ~50% neuron pruning rate, which is presented as an observed outcome rather than an input parameter.

free parameters (1)

neuron pruning fraction
The abstract states the method zeros out around 50% of hidden neurons; this fraction is chosen during the procedure and directly affects the final model size.

axioms (1)

domain assumption Standard back-propagation can be applied to dynamically sparse networks without modification
Implicit in the use of SET plus pruning on MLPs.

pith-pipeline@v0.9.0 · 5664 in / 1208 out tokens · 28952 ms · 2026-05-25T13:57:38.967145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Dziugaite, G. K. and Roy, D. M. Computing nonvacu- ous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Ese: Efﬁcient speech recognition engine with sparse lstm on fpga

Han, S., Kang, J., Mao, H., Hu, Y ., Li, X., Li, Y ., Xie, D., Luo, H., Y ao, S., Wang, Y ., et al. Ese: Efﬁcient speech recognition engine with sparse lstm on fpga. In Proceed- ings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , pp. 75–84. ACM,

work page 2017
[3]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe, S. and Szegedy, C. Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 ,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

P ., Y oung, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al

Jouppi, N. P ., Y oung, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 1–12. IEEE,

work page 2017
[5]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P . T. P . On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Lee, N., Ajanthan, T., and Torr, P . H. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

C., Matavalam, A

Liu, S., Mocanu, D. C., Matavalam, A. R. R., Pei, Y ., and Pechenizkiy, M. Sparse evolutionary deep learning with over one million artiﬁcial neurons on commodity hard- ware. arXiv preprint arXiv:1901.09181 ,

work page arXiv 1901
[9]

Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

URL http://arxiv.org/abs/1902.05967. Narang, S., Elsen, E., Diamos, G., and Sengupta, S. Explor- ing sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[10]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Sre- bro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems , pp. 5947–5956, 2017a. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized mar- gin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017...

work page internal anchor Pith review Pith/arXiv arXiv 1904
[12]

Dropout: a simple way to pre- vent neural networks from overﬁtting

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to pre- vent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929–1958,

work page 1929
[13]

Multi-objective Evolutionary Federated Learning

Zhu, H. and Jin, Y . Multi-objective evolutionary federated learning. CoRR, abs/1812.07478, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Dziugaite, G. K. and Roy, D. M. Computing nonvacu- ous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Ese: Efﬁcient speech recognition engine with sparse lstm on fpga

Han, S., Kang, J., Mao, H., Hu, Y ., Li, X., Li, Y ., Xie, D., Luo, H., Y ao, S., Wang, Y ., et al. Ese: Efﬁcient speech recognition engine with sparse lstm on fpga. In Proceed- ings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , pp. 75–84. ACM,

work page 2017

[3] [3]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe, S. and Szegedy, C. Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 ,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

P ., Y oung, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al

Jouppi, N. P ., Y oung, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 1–12. IEEE,

work page 2017

[5] [5]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P . T. P . On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Lee, N., Ajanthan, T., and Torr, P . H. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

C., Matavalam, A

Liu, S., Mocanu, D. C., Matavalam, A. R. R., Pei, Y ., and Pechenizkiy, M. Sparse evolutionary deep learning with over one million artiﬁcial neurons on commodity hard- ware. arXiv preprint arXiv:1901.09181 ,

work page arXiv 1901

[8] [9]

Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

URL http://arxiv.org/abs/1902.05967. Narang, S., Elsen, E., Diamos, G., and Sengupta, S. Explor- ing sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[9] [10]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 ,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Sre- bro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems , pp. 5947–5956, 2017a. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized mar- gin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017...

work page internal anchor Pith review Pith/arXiv arXiv 1904

[11] [12]

Dropout: a simple way to pre- vent neural networks from overﬁtting

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to pre- vent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929–1958,

work page 1929

[12] [13]

Multi-objective Evolutionary Federated Learning

Zhu, H. and Jin, Y . Multi-objective evolutionary federated learning. CoRR, abs/1812.07478, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018