Mean Spectral Normalization of Deep Neural Networks for Embedded Automation

Anand Krishnamoorthy Subramanian; Nak Young Chong

arxiv: 1907.04003 · v1 · pith:TSUUODELnew · submitted 2019-07-09 · 💻 cs.LG · stat.ML

Mean Spectral Normalization of Deep Neural Networks for Embedded Automation

Anand Krishnamoorthy Subramanian , Nak Young Chong This is my paper

Pith reviewed 2026-05-25 00:29 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords spectral normalizationmean spectral normalizationbatch normalizationweight reparameterizationdeep neural networksembedded automationgradient sparsitygenerative adversarial networks

0 comments

The pith

Mean Spectral Normalization fixes the mean-drift problem in spectral normalization and produces networks that run 16 percent faster than batch normalization with fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies spectral normalization in deep networks and finds that it raises gradient sparsity while holding gradient variance in check. A side effect called mean drift then appears and reduces overall performance relative to batch normalization. The authors introduce mean spectral normalization as a weight reparameterization that removes this drift. The resulting models train and run faster on both classification and generation tasks while using fewer trainable parameters. Experiments cover small, medium, and large convolutional networks plus generative adversarial networks to show the change works across embedded automation settings.

Core claim

Spectral normalization increases gradient sparsity and controls gradient variance yet suffers from mean-drift that limits its effectiveness. Mean spectral normalization corrects the drift by reparameterizing the weights, delivering networks that run approximately 16 percent faster in practice than batch-normalized counterparts and require fewer trainable parameters. The method is demonstrated on a 3-layer CNN, VGG7, DenseNet-BC, and on GAN-based image generation.

What carries the argument

Mean Spectral Normalization (MSN), a weight reparameterization that removes mean drift from spectral normalization.

If this is right

MSN applies directly to convolutional networks of varying depth without adding trainable parameters.
The same reparameterization improves both supervised image classification and unsupervised image generation with GANs.
Inference speed improves by roughly 16 percent compared with batch normalization, suiting resource-limited embedded devices.
Gradient sparsity and variance remain controlled while the mean-drift penalty disappears.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the mean-drift correction proves stable on larger or non-vision tasks, MSN could serve as a drop-in replacement for batch normalization in many embedded pipelines.
The same reparameterization idea might be applied to other normalization schemes that exhibit analogous drift.
Because MSN reduces parameter count while raising speed, it could lower memory and energy costs in real-time automation systems.
Further tests on recurrent or transformer architectures would show whether the drift phenomenon and its fix are architecture-specific.

Load-bearing premise

Mean drift is the main performance limiter of spectral normalization and mean spectral normalization removes it without introducing new drawbacks that offset the gains.

What would settle it

A head-to-head comparison on a large model such as ResNet trained on ImageNet in which MSN shows no speed gain or lower accuracy than batch normalization would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.04003 by Anand Krishnamoorthy Subramanian, Nak Young Chong.

**Figure 1.** Figure 1: Sparsity of gradients during training DenseNet-BC with LeakyReLU activation function. SN immediately induces a high percent of sparsity to [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Test Accuracy of various normalization methods during training DenseNet-BC, 3-layer CNN and VGG-7 models (with learning rate 0.001). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Summary statistics of layer weights showing the internal covariate shift for various normalization methods while training DenseNet-BC. During [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Mean drift correction by MSN during training DenseNet-BC. Our [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Gradient Histograms of layers 44 and 75 (chosen randomly) of DenseNet-BC with BN, SN and MSN respectively. For the spectral normalized [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Mean layer-singular values during training for DenseNet-BC. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Deep Neural Networks (DNNs) have begun to thrive in the field of automation systems, owing to the recent advancements in standardising various aspects such as architecture, optimization techniques, and regularization. In this paper, we take a step towards a better understanding of Spectral Normalization (SN) and its potential for standardizing regularization of a wider range of Deep Learning models, following an empirical approach. We conduct several experiments to study their training dynamics, in comparison with the ubiquitous Batch Normalization (BN) and show that SN increases the gradient sparsity and controls the gradient variance. Furthermore, we show that SN suffers from a phenomenon, we call the mean-drift effect, which mitigates its performance. We, then, propose a weight reparameterization called as the Mean Spectral Normalization (MSN) to resolve the mean drift, thereby significantly improving the network's performance. Our model performs ~16% faster as compared to BN in practice, and has fewer trainable parameters. We also show the performance of our MSN for small, medium, and large CNNs - 3-layer CNN, VGG7 and DenseNet-BC, respectively - and unsupervised image generation tasks using Generative Adversarial Networks (GANs) to evaluate its applicability for a broad range of embedded automation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSN is a straightforward reparameterization fix for mean-drift in SN, with experiments on a handful of models but no ablations or broader tests to back the performance claims.

read the letter

The main point is that this paper defines a mean-drift effect in Spectral Normalization, then introduces Mean Spectral Normalization as a weight reparameterization to counter it. They report that the result runs about 16% faster than Batch Normalization in practice and uses fewer trainable parameters while keeping SN's gradient sparsity and variance control benefits. The experiments cover a 3-layer CNN, VGG7, DenseNet-BC, and GANs for image generation.

Referee Report

2 major / 2 minor

Summary. The paper empirically studies Spectral Normalization (SN) in DNNs, reports that it increases gradient sparsity and controls gradient variance relative to Batch Normalization (BN), identifies a 'mean-drift effect' that limits SN performance, and introduces Mean Spectral Normalization (MSN) as a weight reparameterization to correct it. MSN is claimed to yield better accuracy, ~16% faster training than BN, and fewer trainable parameters while preserving SN benefits; results are shown on a 3-layer CNN, VGG7, DenseNet-BC, and GANs for embedded automation tasks.

Significance. If the mean-drift diagnosis and its resolution by MSN are shown to be robust, the work would offer a practical regularization technique with measurable speed and parameter advantages for resource-constrained automation systems. The comparative training-dynamics analysis of SN versus BN is a useful empirical contribution, but the narrow experimental scope limits the strength of the generalization claim.

major comments (2)

[Abstract and Experiments] Abstract and experimental sections: the central claim that MSN resolves mean-drift and generalizes to 'a broad range of embedded automation tasks' rests on results from only four architectures (3-layer CNN, VGG7, DenseNet-BC, GANs); no ablation isolates mean-drift removal from other implementation changes, and no results are reported for larger models (e.g., ResNet-scale) or non-vision domains.
[Method and Results] Method and results sections: the definition of the mean-drift effect and the precise reparameterization that produces MSN are introduced without an explicit equation or derivation showing that the spectral-norm constraint is preserved; performance gains are asserted without error bars, statistical tests, or exclusion criteria for the reported speed-up and accuracy numbers.

minor comments (2)

[Abstract] Abstract: the '~16% faster' claim should specify the exact metric (wall-clock time per epoch, FLOPs, or inference latency) and the model on which it was measured.
[Method] Notation: the relationship between the new MSN scaling factor and the original spectral-norm Lipschitz constant should be stated explicitly to allow readers to verify that gradient-sparsity and variance-control properties are retained.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We respond to the major comments point-by-point below.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and experimental sections: the central claim that MSN resolves mean-drift and generalizes to 'a broad range of embedded automation tasks' rests on results from only four architectures (3-layer CNN, VGG7, DenseNet-BC, GANs); no ablation isolates mean-drift removal from other implementation changes, and no results are reported for larger models (e.g., ResNet-scale) or non-vision domains.

Authors: We agree that the experiments cover only the four listed architectures, selected to span small-to-large CNNs and GANs in vision-based embedded automation. The generalization phrasing in the abstract is therefore stronger than the evidence directly supports. In revision we will moderate the abstract and introduction claims and add an explicit limitations paragraph. We will also expand the methods discussion to clarify how the mean-drift correction is isolated by construction in MSN. We cannot add new results on ResNet-scale models or non-vision domains without substantial additional experiments. revision: partial
Referee: [Method and Results] Method and results sections: the definition of the mean-drift effect and the precise reparameterization that produces MSN are introduced without an explicit equation or derivation showing that the spectral-norm constraint is preserved; performance gains are asserted without error bars, statistical tests, or exclusion criteria for the reported speed-up and accuracy numbers.

Authors: The manuscript defines mean-drift and the MSN reparameterization in Section 3, but we accept that an explicit equation and short derivation confirming preservation of the spectral-norm constraint would improve clarity. We will insert both in the revised methods section. For the reported performance numbers we will add error bars, note the number of runs, and specify the measurement protocol and any exclusion criteria used for the ~16 % speed-up figures. revision: yes

standing simulated objections not resolved

Results on larger models (ResNet-scale) or non-vision domains cannot be supplied without new experiments.

Circularity Check

0 steps flagged

No circularity: empirical method proposal with direct experimental validation

full rationale

The paper follows an explicitly empirical approach: it observes gradient sparsity/variance and mean-drift effects by comparing SN against BN on concrete networks, then introduces MSN as a reparameterization and reports measured speed/accuracy outcomes on the listed architectures. No derivation chain, uniqueness theorem, or fitted parameter is invoked; performance numbers are direct training results rather than quantities defined to equal the inputs by construction. The central claims therefore remain independent of the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities beyond the MSN technique itself can be extracted.

invented entities (1)

Mean Spectral Normalization (MSN) no independent evidence
purpose: Reparameterization to resolve mean-drift effect in Spectral Normalization
New technique introduced to address the identified limitation of SN.

pith-pipeline@v0.9.0 · 5755 in / 1147 out tokens · 25304 ms · 2026-05-25T00:29:57.059714+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We, then, propose a weight reparameterization called as the Mean Spectral Normalization (MSN) to resolve the mean drift... hk = W/σ(W) gk−1 ; ˜hk = hk− E[hk] + m
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SN increases the gradient sparsity and controls the gradient variance... mean-drift effect

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 13 internal anchors

[1]

G. S. Chadha and A. Schwung. Comparison of deep neural network architectures for fault detection in tennessee eastman process. In Emerging Technologies and Factory Automation, 22nd IEEE Int’l Conf. on , pages 1–8. IEEE, 2017

work page 2017
[2]

Cheng, L

Y . Cheng, L. Zou, Z. Zhuang, Z. Sun, and W. Zhang. Deep rein- forcement learning combustion optimization system using synchronous neural episodic control. In 37th Chinese Control Conference , pages 8770–8775. IEEE, 2018

work page 2018
[3]

Choi and D.-J

H.-J. Choi and D.-J. Kang. Localization of welding defects using a weakly supervised neural network. In 18th Int’l Conf. on Control, Automation and Systems , pages 1461–1463. IEEE, 2018

work page 2018
[4]

Fadlullah, F

Z. Fadlullah, F. Tang, B. Mao, N. Kato, O. Akashi, T. Inoue, and K. Mizutani. State-of-the-art deep learning: Evolving machine intel- ligence toward tomorrow’s intelligent network trafﬁc control systems. IEEE Communications Surveys & Tutorials , 19(4):2432–2455, 2017

work page 2017
[5]

V . N. Nguyen, R. Jenssen, and D. Roverso. Automatic autonomous vision-based power line inspection: A review of current status and the potential role of deep learning. Int’l Journal of Electrical Power & Energy Systems, 99:107–120, 2018

work page 2018
[6]

Generative Adversarial Learning for Spectrum Sensing

K. Davaslioglu and Y . E. Sagduyu. Generative adversarial learning for spectrum sensing. arXiv preprint arXiv:1804.00709 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Generative Adversarial Network based Autoencoder: Application to fault detection problem for closed loop dynamical systems

I. Chakraborty, R. Chakraborty, and D. Vrabie. Generative adversarial network based autoencoder: Application to fault detection problem for closed loop dynamical systems. arXiv preprint arXiv:1804.05320 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Li and Y

Y . Li and Y . Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems , pages 8168–8177, 2018

work page 2018
[9]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Salimans and D

T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems , pages 901–909, 2016

work page 2016
[11]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Miyato, T

T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for generative adversarial networks. In Int’l Conf. on Learning Representations, 2018

work page 2018
[13]

Thomas G

Bingnan W. Thomas G. H. Shen Z., Shibo Z. Machine learning and deep learning algorithms for bearing fault diagnostics - a comprehen- sive review. arXiv preprint arXiv:1901.08247 , 2019

work page arXiv 1901
[14]

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understand- ing deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

How Does Batch Normalization Help Optimization?

S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Understanding Batch Normalization

J. Bjorck, C. Gomes, and B. Selman. Understanding batch normaliza- tion. arXiv preprint arXiv:1806.02375 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

I. Gitman and B. Ginsburg. Comparison of batch normalization and weight normalization algorithms for the large-scale image classiﬁca- tion. arXiv preprint arXiv:1709.08145 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Instance Normalization: The Missing Ingredient for Fast Stylization

D. Ulyanov, A. Vedaldi, and V . Lempitsky. Instance normaliza- tion: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017

work page 2017
[20]

Spectral Norm Regularization for Improving the Generalizability of Deep Learning

Y . Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Neyshabur, S

B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Explor- ing generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017

work page 2017
[22]

H. Gouk, E. Frank, B. Pfahringer, and M. Cree. Regularisation of neural networks by enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018

work page arXiv 2018
[23]

Huang, Z

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017

work page 2017
[24]

Jiang and G

P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems , pages 2526–2537, 2018

work page 2018
[25]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems , pages 5767–5777, 2017

work page 2017
[26]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014

work page 2014
[27]

Wasserstein GAN

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems , pages 6626–6637, 2017

work page 2017
[30]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems , pages 2234–2242, 2016

work page 2016
[31]

Generalizable Adversarial Training via Spectral Normalization

F. Farnia, J. M Zhang, and D. Tse. Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

G. S. Chadha and A. Schwung. Comparison of deep neural network architectures for fault detection in tennessee eastman process. In Emerging Technologies and Factory Automation, 22nd IEEE Int’l Conf. on , pages 1–8. IEEE, 2017

work page 2017

[2] [2]

Cheng, L

Y . Cheng, L. Zou, Z. Zhuang, Z. Sun, and W. Zhang. Deep rein- forcement learning combustion optimization system using synchronous neural episodic control. In 37th Chinese Control Conference , pages 8770–8775. IEEE, 2018

work page 2018

[3] [3]

Choi and D.-J

H.-J. Choi and D.-J. Kang. Localization of welding defects using a weakly supervised neural network. In 18th Int’l Conf. on Control, Automation and Systems , pages 1461–1463. IEEE, 2018

work page 2018

[4] [4]

Fadlullah, F

Z. Fadlullah, F. Tang, B. Mao, N. Kato, O. Akashi, T. Inoue, and K. Mizutani. State-of-the-art deep learning: Evolving machine intel- ligence toward tomorrow’s intelligent network trafﬁc control systems. IEEE Communications Surveys & Tutorials , 19(4):2432–2455, 2017

work page 2017

[5] [5]

V . N. Nguyen, R. Jenssen, and D. Roverso. Automatic autonomous vision-based power line inspection: A review of current status and the potential role of deep learning. Int’l Journal of Electrical Power & Energy Systems, 99:107–120, 2018

work page 2018

[6] [6]

Generative Adversarial Learning for Spectrum Sensing

K. Davaslioglu and Y . E. Sagduyu. Generative adversarial learning for spectrum sensing. arXiv preprint arXiv:1804.00709 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Generative Adversarial Network based Autoencoder: Application to fault detection problem for closed loop dynamical systems

I. Chakraborty, R. Chakraborty, and D. Vrabie. Generative adversarial network based autoencoder: Application to fault detection problem for closed loop dynamical systems. arXiv preprint arXiv:1804.05320 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Li and Y

Y . Li and Y . Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems , pages 8168–8177, 2018

work page 2018

[9] [9]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Salimans and D

T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems , pages 901–909, 2016

work page 2016

[11] [11]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Miyato, T

T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for generative adversarial networks. In Int’l Conf. on Learning Representations, 2018

work page 2018

[13] [13]

Thomas G

Bingnan W. Thomas G. H. Shen Z., Shibo Z. Machine learning and deep learning algorithms for bearing fault diagnostics - a comprehen- sive review. arXiv preprint arXiv:1901.08247 , 2019

work page arXiv 1901

[14] [14]

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understand- ing deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

How Does Batch Normalization Help Optimization?

S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Understanding Batch Normalization

J. Bjorck, C. Gomes, and B. Selman. Understanding batch normaliza- tion. arXiv preprint arXiv:1806.02375 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

I. Gitman and B. Ginsburg. Comparison of batch normalization and weight normalization algorithms for the large-scale image classiﬁca- tion. arXiv preprint arXiv:1709.08145 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Instance Normalization: The Missing Ingredient for Fast Stylization

D. Ulyanov, A. Vedaldi, and V . Lempitsky. Instance normaliza- tion: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017

work page 2017

[20] [20]

Spectral Norm Regularization for Improving the Generalizability of Deep Learning

Y . Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Neyshabur, S

B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Explor- ing generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017

work page 2017

[22] [22]

H. Gouk, E. Frank, B. Pfahringer, and M. Cree. Regularisation of neural networks by enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018

work page arXiv 2018

[23] [23]

Huang, Z

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017

work page 2017

[24] [24]

Jiang and G

P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems , pages 2526–2537, 2018

work page 2018

[25] [25]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems , pages 5767–5777, 2017

work page 2017

[26] [26]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014

work page 2014

[27] [27]

Wasserstein GAN

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems , pages 6626–6637, 2017

work page 2017

[30] [30]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems , pages 2234–2242, 2016

work page 2016

[31] [31]

Generalizable Adversarial Training via Spectral Normalization

F. Farnia, J. M Zhang, and D. Tse. Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018