pith. machine review for the scientific record. sign in

arxiv: 1710.05941 · v2 · submitted 2017-10-16 · 💻 cs.NE · cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Searching for Activation Functions

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:43 UTC · model grok-4.3

classification 💻 cs.NE cs.CVcs.LG
keywords activation functionsSwishReLUdeep neural networksautomatic searchreinforcement learningImageNetneural architecture search
0
0 comments X

The pith

Automatic search discovers Swish activation function that works better than ReLU on deeper networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace hand-designed activation functions with ones found through systematic search. It combines exhaustive enumeration with reinforcement learning to explore possible functions and identifies Swish, given by f(x) = x times sigmoid of beta x, as the strongest performer. A sympathetic reader would care because activation choice affects how networks train and how well they solve tasks, so a better default could improve many existing models at little cost. Tests show Swish lifts accuracy on ImageNet and other datasets when swapped directly into established architectures. The result suggests that search can automate a choice previously left to manual trial and error.

Core claim

Using a combination of exhaustive and reinforcement learning-based search, the authors discover multiple activation functions and identify the best one, f(x) = x · sigmoid(βx), which they name Swish. Swish tends to work better than ReLU on deeper models across challenging datasets. Simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it straightforward for practitioners to adopt.

What carries the argument

The Swish activation function f(x) = x · sigmoid(βx), found by combining exhaustive search over simple expressions with reinforcement-learning-guided search over more complex candidates.

If this is right

  • Deeper networks show larger relative gains from Swish than shallower ones.
  • Swish can be dropped into any existing architecture in place of ReLU without other changes.
  • The same search procedure yields several other functions that also outperform ReLU in the reported experiments.
  • Practitioners can adopt Swish immediately because it requires no new hyperparameters beyond the single scalar beta.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The search framework could be reused to discover task-specific activation functions rather than a single universal one.
  • Because Swish is smooth and non-monotonic near zero, it may interact differently with gradient-based optimizers than ReLU does.
  • Similar automated search might be applied to other low-level choices such as normalization layers or loss functions.

Load-bearing premise

The performance gains come from intrinsic properties of the discovered functions rather than from interactions with the specific model architectures, training schedules, or hyperparameter settings used in the tests.

What would settle it

A controlled study that swaps Swish into a broad collection of models while holding all other training details fixed and finds no consistent accuracy improvement would show the advantage is not general.

read the original abstract

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes leveraging automatic search techniques, including exhaustive and reinforcement learning-based methods, to discover new activation functions for deep neural networks. The best discovered function, named Swish and defined as f(x) = x · sigmoid(βx), is empirically shown to outperform the standard ReLU activation on deeper models across challenging datasets, with specific improvements such as 0.9% top-1 accuracy gain on ImageNet for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 upon simple replacement.

Significance. If the empirical results are robust to hyperparameter choices, this work is significant in providing a simple yet effective alternative to ReLU that practitioners can easily adopt. The automated search approach offers a systematic alternative to hand-designed activations and demonstrates concrete gains on standard benchmarks like ImageNet, which is a strength of the manuscript.

major comments (2)
  1. [§4] §4: The experiments report accuracy improvements by replacing ReLU with Swish while retaining identical training schedules, optimizer, learning-rate schedule, batch size, and initialization chosen for ReLU. Without per-activation hyperparameter re-optimization or ablation on schedule sensitivity, it remains unclear whether the reported gains (0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 on ImageNet) are caused by the intrinsic properties of Swish or by interactions with ReLU-tuned protocols.
  2. [§3] §3: The search space definition, number of trials, and statistical controls for both the exhaustive and RL-based searches are insufficiently detailed. This affects assessment of whether Swish was reliably identified as superior rather than selected post-hoc from a large set of candidates.
minor comments (3)
  1. The value of β used in the reported Swish experiments should be explicitly stated (fixed or learned) along with sensitivity analysis.
  2. Additional citations to prior parametric activation functions (e.g., PReLU) would strengthen the related-work discussion.
  3. Figures comparing activation functions would benefit from including their derivatives to illustrate effects on gradient flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4: The experiments report accuracy improvements by replacing ReLU with Swish while retaining identical training schedules, optimizer, learning-rate schedule, batch size, and initialization chosen for ReLU. Without per-activation hyperparameter re-optimization or ablation on schedule sensitivity, it remains unclear whether the reported gains (0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 on ImageNet) are caused by the intrinsic properties of Swish or by interactions with ReLU-tuned protocols.

    Authors: We agree that the hyperparameters and training schedules were those originally tuned for ReLU, and no per-activation re-optimization or schedule ablation was performed. This design choice was made to evaluate Swish as a drop-in replacement that requires no additional tuning effort from practitioners. The consistent gains across two distinct architectures (Mobile NASNet-A and Inception-ResNet-v2) and multiple datasets provide supporting evidence that the improvements are not solely due to protocol interactions. Nevertheless, we acknowledge the limitation noted by the referee. In the revised manuscript we will add an explicit discussion of this point in Section 4 and include a limited ablation study on learning-rate sensitivity for Swish versus ReLU using a smaller proxy task. revision: partial

  2. Referee: [§3] §3: The search space definition, number of trials, and statistical controls for both the exhaustive and RL-based searches are insufficiently detailed. This affects assessment of whether Swish was reliably identified as superior rather than selected post-hoc from a large set of candidates.

    Authors: We thank the referee for highlighting the need for greater transparency in the search methodology. The original Section 3 described the overall approach at a high level but omitted precise specifications of the search space, exact trial counts, and statistical safeguards. In the revision we will expand Section 3 to enumerate the full set of unary and binary operations, the constants considered, the total number of functions evaluated in the exhaustive search, the RL training details (agent architecture, number of episodes, and reward formulation), and any repeated runs or variance metrics used to rank candidates. These additions will allow readers to evaluate the reliability of Swish's selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search discovery followed by independent benchmark evaluation

full rationale

The paper's chain consists of (1) applying exhaustive and RL-based search over activation function spaces to identify candidates, (2) selecting the best performer f(x) = x · sigmoid(βx), and (3) measuring its accuracy when substituted into fixed architectures on held-out datasets such as ImageNet. None of these steps reduces a reported performance delta to a quantity defined by the same fitted parameters or search objective used to generate the candidate; the gains are direct empirical measurements on standard test sets under the paper's stated protocol. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no equation equates a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that functions found by the described search procedure will generalize and that the reported accuracy deltas are attributable to the activation choice.

free parameters (1)
  • β
    Scaling parameter inside the discovered Swish form; its value is part of the function definition and may have been chosen or optimized during search.
axioms (1)
  • domain assumption The space of functions considered during search contains useful activation functions that transfer to held-out models and tasks.
    Invoked when the authors treat the search output as a general recommendation rather than a one-off result.

pith-pipeline@v0.9.0 · 5516 in / 1328 out tokens · 56415 ms · 2026-05-12T02:43:54.485348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our experiments show that the best discovered activation function, f(x) = x · sigmoid(βx), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

    cs.LG 2026-05 unverdicted novelty 8.0

    Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.

  2. KAN: Kolmogorov-Arnold Networks

    cs.LG 2024-04 conditional novelty 8.0

    KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

  3. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  4. Neural Statistical Functions

    cs.LG 2026-05 unverdicted novelty 7.0

    Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

  5. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  6. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...

  7. Selectivity and Shape in the Design of Forward-Forward Goodness Functions

    cs.LG 2026-03 unverdicted novelty 7.0

    Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.

  8. SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning

    cs.LG 2026-03 unverdicted novelty 7.0

    SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.

  9. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  10. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  11. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  12. On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

    math.OC 2026-05 unverdicted novelty 6.0

    Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

  13. Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse MoE in FFN blocks redistributes computation to attention in small Transformers primarily due to architectural capacity reduction and partitioning, not learned router specialization.

  14. MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.

  15. MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.

  16. What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies

    cs.LG 2026-05 unverdicted novelty 6.0

    MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.

  17. On the Blessing of Pre-training in Weak-to-Strong Generalization

    cs.LG 2026-05 unverdicted novelty 6.0

    Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.

  18. Competing nonlinearities, criticality, and order-to-chaos transition in deep networks

    cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

    A statistical mixture of Tanh and Swish activations with critical mixing fraction p_c induces a continuous phase transition to scale-invariant signal propagation in deep networks while preserving smoothness.

  19. Neural-network reconstruction of THz transmission spectra using electrically tunable AlGaN/GaN plasmonic-crystal analyzer

    physics.optics 2026-05 unverdicted novelty 6.0

    A feedforward neural network trained on synthetic data inverts voltage-dependent intensities from an electrically tunable AlGaN/GaN plasmonic analyzer to reconstruct THz spectra, achieving lower error than Tikhonov re...

  20. Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics

    cs.LG 2026-05 unverdicted novelty 6.0

    EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10...

  21. AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

  22. Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks

    nucl-th 2026-04 unverdicted novelty 6.0

    A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.

  23. Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

    cs.LG 2026-04 unverdicted novelty 6.0

    GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.

  24. A Complex-Valued Continuous-Variable Quantum Approximation Optimization Algorithm (CCV-QAOA)

    quant-ph 2026-04 unverdicted novelty 6.0

    CCV-QAOA is a new complex-valued continuous-variable variant of QAOA that solves real and complex multivariate optimization problems via a variational framework.

  25. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...

  26. OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning

    physics.ao-ph 2026-04 conditional novelty 6.0

    Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the N...

  27. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  28. Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices

    math.NA 2026-05 unverdicted novelty 5.0

    A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.

  29. Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery

    cs.CV 2026-05 unverdicted novelty 5.0

    LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.

  30. Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions

    cs.AI 2026-05 unverdicted novelty 5.0

    BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.

  31. AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    AGoQ cuts LLM training memory by up to 52% and speeds it up by 1.34x using tailored 4-bit activations and 8-bit gradients with special communication, matching baseline accuracy on LLaMA models.

  32. GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories

    cs.RO 2026-04 unverdicted novelty 5.0

    GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks ...

  33. Physics-informed neural networks for form-finding of unilateral membrane structures

    cs.CE 2026-04 unverdicted novelty 5.0

    PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.

  34. ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications

    cs.LG 2026-04 unverdicted novelty 5.0

    ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.

  35. YOLOv4: Optimal Speed and Accuracy of Object Detection

    cs.CV 2020-04 unverdicted novelty 5.0

    YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.

  36. Agentic Risk-Aware Set-Based Engineering Design

    cs.AI 2026-04 unverdicted novelty 4.0

    Multi-agent LLM system applies set-based design and Conditional Value-at-Risk to explore and risk-filter airfoil designs with human manager coordination.

  37. GLU Variants Improve Transformer

    cs.LG 2020-02 unverdicted novelty 4.0

    Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

  38. Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance

    econ.GN 2026-05 unverdicted novelty 3.0

    The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.

  39. Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification

    cs.CV 2026-05 unverdicted novelty 3.0

    A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.

  40. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

  41. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 37 Pith papers · 4 internal anchors

  1. [1]

    Learning activation functions to improve deep neural networks

    Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830,

  2. [2]

    Reinforcement learning for architecture search by network transformation

    Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873,

  3. [3]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,

  4. [4]

    Language Modeling with Gated Convolutional Networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083,

  5. [5]

    RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

    Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforce- ment learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,

  6. [6]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118,

  7. [7]

    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400,

  8. [8]

    HyperNetworks

    David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,

  9. [9]

    Gaussian Error Linear Units (GELUs)

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–64...

  10. [10]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision ap- plications. arXiv preprint arXiv:1704.04861,

  11. [11]

    What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision ,

    Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision ,

  12. [12]

    Self-normalizing neural net- works

    G¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural net- works. arXiv preprint arXiv:1706.02515,

  13. [13]

    Learnable pooling with context gating for video classification

    Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905,

  14. [14]

    Flexible rectified linear units for improving convolutional neural networks

    Suo Qiu and Bolun Cai. Flexible rectified linear units for improving convolutional neural networks. arXiv preprint arXiv:1706.08098,

  15. [15]

    Large-scale evolution of image classifiers

    Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041,

  16. [16]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  17. [17]

    Highway Networks

    Rupesh Kumar Srivastava, Klaus Greff, and J ¨urgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387,

  18. [18]

    Learning to reinforcement learn

    Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763,

  19. [19]

    Empirical evalua- tion of rectified activations in convolutional network

    12 Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853,

  20. [20]

    Practical network blocks design with q-learning

    Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552,

  21. [21]

    Deep interest network for click-through rate prediction

    Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Xingya Dai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. arXiv preprint arXiv:1706.06978,

  22. [22]

    Learning transferable architectures for scal- able image recognition

    Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scal- able image recognition. arXiv preprint arXiv:1707.07012,