pith. machine review for the scientific record. sign in

arxiv: 1905.11946 · v5 · submitted 2019-05-28 · 💻 cs.LG · cs.CV· stat.ML

Recognition: 3 theorem links

· Lean Theorem

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords EfficientNetcompound scalingmodel scalingconvolutional neural networksImageNet accuracyneural architecture searchaccuracy-efficiency tradeoff
0
0 comments X

The pith

Scaling depth, width, and resolution together with one compound coefficient produces more accurate and efficient convolutional networks than scaling any single dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that fixed-resource ConvNets improve more when depth, width, and resolution are increased in a coordinated way rather than one at a time. A single coefficient phi multiplies each dimension by fixed factors found through grid search on a small model. This compound scaling is applied first to existing architectures like MobileNet and ResNet, then to a new baseline network discovered by neural architecture search. The resulting EfficientNet family reaches 84.3 percent top-1 accuracy on ImageNet while using far fewer parameters and less inference time than earlier leaders. The same models also transfer effectively to CIFAR-100, Flowers, and other datasets.

Core claim

A compound scaling method that raises network depth by alpha to the power phi, width by beta to the power phi, and resolution by gamma to the power phi, with alpha, beta, and gamma fixed by grid search on a baseline model, yields a family of networks called EfficientNets. EfficientNet-B7 attains 84.3 percent top-1 accuracy on ImageNet while requiring 8.4 times fewer parameters and 6.1 times less inference time than the previous best ConvNet.

What carries the argument

Compound scaling coefficient phi that simultaneously enlarges depth, width, and resolution according to fixed ratios determined once on a small network.

If this is right

  • EfficientNet-B7 sets a new accuracy record on ImageNet while cutting model size by 8.4 times and inference time by 6.1 times relative to prior ConvNets.
  • The same compound scaling improves both MobileNets and ResNets without changing their architectures.
  • EfficientNets maintain state-of-the-art results on CIFAR-100, Flowers, and three additional transfer datasets while using an order of magnitude fewer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation between finding a good baseline architecture and then scaling it uniformly may apply to architectures other than ConvNets.
  • Systematic scaling rules could reduce the need for repeated architecture search when moving to new hardware constraints.
  • Adaptive versions of the coefficient might further improve performance on tasks with different accuracy versus speed priorities.

Load-bearing premise

The scaling ratios found by grid search on a small baseline network stay near-optimal when the same ratios are used on much larger models and on different datasets.

What would settle it

A larger model trained with scaling ratios different from those found on the small baseline achieves higher ImageNet accuracy than the model produced by the compound coefficient.

read the original abstract

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that carefully balancing network depth, width, and resolution via a single compound scaling coefficient phi yields better accuracy-efficiency tradeoffs than scaling any one dimension independently. The authors first identify fixed scaling ratios (alpha, beta, gamma) by grid search on a small baseline, demonstrate the method on MobileNet and ResNet families, then use neural architecture search to obtain EfficientNet-B0 and scale it uniformly to produce the B1-B7 family. EfficientNet-B7 reaches 84.3% top-1 ImageNet accuracy while being 8.4x smaller and 6.1x faster than prior best ConvNets, with strong transfer results on CIFAR-100, Flowers, and three additional datasets.

Significance. If the empirical results hold, the work is significant because it supplies a simple, reproducible scaling rule that improves upon conventional single-dimension scaling and has been widely adopted as a baseline. The large-scale ImageNet experiments, cross-family validation on MobileNet/ResNet, and transfer-task results provide direct support for the central claim, while the public code release aids reproducibility.

major comments (1)
  1. §3.2: the grid search that fixes alpha=1.2, beta=1.1, gamma=1.15 is performed only on the small baseline with phi in [1,5]; although the paper shows consistent gains when these ratios are applied to larger models, the load-bearing assumption that the ratios remain near-optimal at scale is supported only by the final held-out results rather than by intermediate-scale ablations that would quantify sensitivity to the chosen coefficients.
minor comments (2)
  1. The abstract states that EfficientNets achieve state-of-the-art accuracy on '3 other transfer learning datasets' but does not name them; explicitly listing all five datasets in the abstract would improve clarity.
  2. Eq. (2) and the surrounding text: the FLOPS constraint alpha * beta^2 * gamma^2 ≈ 2 is stated without a short derivation or reference to the underlying FLOPS scaling assumptions; adding one sentence would make the origin of the constant transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. The summary accurately reflects the paper's contributions. We respond to the major comment below.

read point-by-point responses
  1. Referee: [—] §3.2: the grid search that fixes alpha=1.2, beta=1.1, gamma=1.15 is performed only on the small baseline with phi in [1,5]; although the paper shows consistent gains when these ratios are applied to larger models, the load-bearing assumption that the ratios remain near-optimal at scale is supported only by the final held-out results rather than by intermediate-scale ablations that would quantify sensitivity to the chosen coefficients.

    Authors: We appreciate the referee's careful reading of §3.2. The grid search for α=1.2, β=1.1, γ=1.15 was performed on the small baseline for φ ∈ [1,5] because the compound scaling rule is derived from the FLOPs equation, which predicts that the relative ratios among depth, width, and resolution should remain approximately constant across scales. We then directly test this assumption by applying the same fixed ratios to scale MobileNet and ResNet families as well as EfficientNet-B0 up to B7. The resulting models exhibit consistent accuracy-efficiency gains, culminating in EfficientNet-B7's state-of-the-art results. While additional ablations at every intermediate scale would be informative, they are computationally prohibitive; the broad validation across model families and the final held-out performance on large models constitute the most relevant evidence. In the revised manuscript we will add a short clarifying paragraph in §3.2 that explicitly states the theoretical motivation for constant ratios and summarizes the cross-scale validation already present in the experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper determines scaling coefficients alpha, beta, and gamma via grid search on the small EfficientNet-B0 baseline to satisfy the compound scaling constraint. These fixed ratios are then applied uniformly via a single coefficient phi to produce larger models B1-B7. However, each scaled model is trained independently from scratch and evaluated on ImageNet plus transfer datasets, yielding accuracy and efficiency numbers that constitute external empirical measurements rather than outputs forced by the fitting procedure. No equation reduces to its own inputs by construction, no load-bearing self-citation is invoked for uniqueness, and the central performance claims rest on held-out training runs rather than tautological renaming or prediction of fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that balanced scaling is superior, with the compound coefficient serving as the main tunable parameter.

free parameters (1)
  • compound coefficient phi
    Chosen via grid search on the baseline network to match target resource budgets; different integer values produce the B1-B7 family.
axioms (1)
  • domain assumption There exists a fixed set of scaling ratios for depth, width, and resolution that remains near-optimal across model sizes.
    Invoked when the authors apply the same ratios found on B0 to all larger models.

pith-pipeline@v0.9.0 · 5541 in / 1214 out tokens · 36767 ms · 2026-05-16T12:15:52.068716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • PhiForcing phi_equation echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient... d=α^φ, w=β^φ, r=γ^φ s.t. α·β²·γ²≈2

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

    cs.CV 2026-04 unverdicted novelty 7.0

    The C-Score quantifies intra-class explanation consistency for CAM methods via confidence-weighted pairwise soft IoU and detects AUC-consistency dissociation as an early warning for model instability on chest X-ray cl...

  2. SMCNet: Supervised Surface Material Classification Using mmWave Radar IQ Signals and Complex-valued CNNs

    eess.SP 2026-04 unverdicted novelty 7.0

    SMCNet applies a complex-valued CNN to mmWave radar IQ data for high-accuracy surface material classification across multiple and unseen sensing distances.

  3. The DeepFake Detection Challenge (DFDC) Dataset

    cs.CV 2020-06 accept novelty 7.0

    The DFDC dataset is the largest public collection of face-swapped videos and supports detectors that generalize to in-the-wild deepfakes.

  4. LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...

  5. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  6. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  7. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  8. Sharpness-Aware Minimization for Efficiently Improving Generalization

    cs.LG 2020-10 conditional novelty 6.0

    SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.

  9. Exploring Clustering Capability of Inpainting Model Embeddings for Pattern-based Individual Identification

    cs.CV 2026-05 unverdicted novelty 5.0

    Inpainting auxiliary task improves clustering of embeddings for individual zebrafish identification based on skin patterns.

  10. DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training

    cs.LG 2026-05 unverdicted novelty 5.0

    DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts w...

  11. Equinox: Decentralized Scheduling for Hardware-Aware Orbital Intelligence

    cs.DC 2026-04 unverdicted novelty 5.0

    Equinox uses a barrier-function-derived marginal cost to enable value-based adaptive scheduling and neighbor offloading in energy-constrained satellite constellations, yielding 20-31% throughput gains and higher batte...

  12. Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

    cs.CV 2026-04 unverdicted novelty 5.0

    Models predicting human authenticity judgments produce inconsistent attribution maps across architectures, showing that explanations are non-identifiable.

  13. Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition

    cs.CV 2026-01 unverdicted novelty 5.0

    FMSD improves cross-dataset generalization in deepfake detection by using gradient-based layer masking to select forgery-sensitive weights and SVD to split them into preserved semantic and multiple learnable artifact ...

  14. Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection

    cs.CV 2026-05 unverdicted novelty 4.0

    A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher mod...

  15. DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation

    cs.CV 2026-04 unverdicted novelty 4.0

    DYMAPIA builds dynamic anomaly masks from Fourier spectra, texture, edges, and optical flow to guide a lightweight DistXCNet classifier, reporting over 99% accuracy and F1 on FF++, Celeb-DF, and VDFD.

  16. Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

    cs.CV 2026-04 unverdicted novelty 3.0

    RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.

  17. Real-Time Cellist Postural Evaluation With On-Device Computer Vision

    cs.HC 2026-04 unverdicted novelty 3.0

    Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.

  18. Robust Deepfake Detection, NTIRE 2026 Challenge: Report

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 challenge finds that large foundation models combined with ensembles and degradation-aware training produce the most robust deepfake detectors.

  19. Scaling Laws for Neural Language Models

    cs.LG 2020-01

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 19 Pith papers · 7 internal anchors

  1. [1]

    L., Jacobs, D

    Berg, T., Liu, J., Woo Lee, S., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. CVPR, pp.\ 2011--2018, 2014

  2. [2]

    Food-101--mining discriminative components with random forests

    Bossard, L., Guillaumin, M., and Van Gool, L. Food-101--mining discriminative components with random forests. ECCV, pp.\ 446--461, 2014

  3. [3]

    Proxylessnas: Direct neural architecture search on target task and hardware

    Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. ICLR, 2019

  4. [4]

    Xception: Deep learning with depthwise separable convolutions

    Chollet, F. Xception: Deep learning with depthwise separable convolutions. CVPR, pp.\ 1610--02357, 2017

  5. [5]

    D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q

    Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. CVPR, 2019

  6. [6]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107: 0 3--11, 2018

  7. [7]

    Squeezenext: Hardware-aware neural network design

    Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., and Keutzer, K. Squeezenext: Hardware-aware neural network design. ECV Workshop at CVPR'18, 2018

  8. [8]

    Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016

  9. [9]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CVPR, pp.\ 770--778, 2016

  10. [10]

    Mask r-cnn

    He, K., Gkioxari, G., Doll \'a r, P., and Girshick, R. Mask r-cnn. ICCV, pp.\ 2980--2988, 2017

  11. [11]

    Amc: Automl for model compression and acceleration on mobile devices

    He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. Amc: Automl for model compression and acceleration on mobile devices. ECCV, 2018

  12. [12]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  13. [13]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

  14. [14]

    Squeeze-and-excitation networks

    Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. CVPR, 2018

  15. [15]

    Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Q. Deep networks with stochastic depth. ECCV, pp.\ 646--661, 2016

  16. [16]

    Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. CVPR, 2017

  17. [17]

    V., and Chen, Z

    Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1808.07233, 2018

  18. [18]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

  19. [19]

    and Szegedy, C

    Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, pp.\ 448--456, 2015

  20. [20]

    Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? CVPR, 2019

  21. [21]

    Collecting a large-scale dataset of fine-grained cars

    Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. Second Workshop on Fine-Grained Visual Categorizatio, 2013

  22. [22]

    and Hinton, G

    Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009

  23. [23]

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, pp.\ 1097--1105, 2012

  24. [24]

    and Jegelka, S

    Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. NeurIPS, pp.\ 6172--6181, 2018

  25. [25]

    Feature pyramid networks for object detection

    Lin, T.-Y., Doll \'a r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. CVPR, 2017

  26. [26]

    Progressive neural architecture search

    Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. ECCV, 2018

  27. [27]

    The expressive power of neural networks: A view from the width

    Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. NeurIPS, 2018

  28. [28]

    Shufflenet v2: Practical guidelines for efficient cnn architecture design

    Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. ECCV, 2018

  29. [29]

    Exploring the Limits of Weakly Supervised Pretraining

    Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018

  30. [30]

    Fine-Grained Visual Classification of Aircraft

    Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

  31. [31]

    Domain Adaptive Transfer Learning with Specialist Models

    Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with specialist models. arXiv preprint arXiv:1811.07056, 2018

  32. [32]

    and Zisserman, A

    Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. ICVGIP, pp.\ 722--729, 2008

  33. [33]

    M., Vedaldi, A., Zisserman, A., and Jawahar, C

    Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. CVPR, pp.\ 3498--3505, 2012

  34. [34]

    On the expressive power of deep neural networks

    Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of deep neural networks. ICML, 2017

  35. [35]

    Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2018

  36. [36]

    Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. AAAI, 2019

  37. [37]

    Imagenet large scale visual recognition challenge

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115 0 (3): 0 211--252, 2015

  38. [38]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR, 2018

  39. [39]

    and Shashua, A

    Sharir, O. and Shashua, A. On the expressive power of overlapping architectures of deep learning. ICLR, 2018

  40. [40]

    Dropout: a simple way to prevent neural networks from overfitting

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 0 (1): 0 1929--1958, 2014

  41. [41]

    Going deeper with convolutions

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CVPR, pp.\ 1--9, 2015

  42. [42]

    Rethinking the inception architecture for computer vision

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. CVPR, pp.\ 2818--2826, 2016

  43. [43]

    Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI, 4: 0 12, 2017

  44. [44]

    Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. MnasNet : Platform-aware neural architecture search for mobile. CVPR, 2019

  45. [45]

    Aggregated residual transformations for deep neural networks

    Xie, S., Girshick, R., Doll \'a r, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. CVPR, pp.\ 5987--5995, 2017

  46. [46]

    Netadapt: Platform-aware neural network adaptation for mobile applications

    Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., and Adam, H. Netadapt: Platform-aware neural network adaptation for mobile applications. ECCV, 2018

  47. [47]

    and Komodakis, N

    Zagoruyko, S. and Komodakis, N. Wide residual networks. BMVC, 2016

  48. [48]

    C., and Lin, D

    Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit of structural diversity in very deep networks. CVPR, pp.\ 3900--3908, 2017

  49. [49]

    Shufflenet: An extremely efficient convolutional neural network for mobile devices

    Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CVPR, 2018

  50. [50]

    Learning deep features for discriminative localization

    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. CVPR, pp.\ 2921--2929, 2016

  51. [51]

    and Le, Q

    Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. ICLR, 2017

  52. [52]

    Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CVPR, 2018