arxiv: 1710.05941 · v2 · submitted 2017-10-16 · 💻 cs.NE · cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Searching for Activation Functions

Prajit Ramachandran , Barret Zoph , Quoc V. Le

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:43 UTC · model grok-4.3

classification 💻 cs.NE cs.CVcs.LG

keywords activation functionsSwishReLUdeep neural networksautomatic searchreinforcement learningImageNetneural architecture search

0 comments

The pith

Automatic search discovers Swish activation function that works better than ReLU on deeper networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace hand-designed activation functions with ones found through systematic search. It combines exhaustive enumeration with reinforcement learning to explore possible functions and identifies Swish, given by f(x) = x times sigmoid of beta x, as the strongest performer. A sympathetic reader would care because activation choice affects how networks train and how well they solve tasks, so a better default could improve many existing models at little cost. Tests show Swish lifts accuracy on ImageNet and other datasets when swapped directly into established architectures. The result suggests that search can automate a choice previously left to manual trial and error.

Core claim

Using a combination of exhaustive and reinforcement learning-based search, the authors discover multiple activation functions and identify the best one, f(x) = x · sigmoid(βx), which they name Swish. Swish tends to work better than ReLU on deeper models across challenging datasets. Simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it straightforward for practitioners to adopt.

What carries the argument

The Swish activation function f(x) = x · sigmoid(βx), found by combining exhaustive search over simple expressions with reinforcement-learning-guided search over more complex candidates.

If this is right

Deeper networks show larger relative gains from Swish than shallower ones.
Swish can be dropped into any existing architecture in place of ReLU without other changes.
The same search procedure yields several other functions that also outperform ReLU in the reported experiments.
Practitioners can adopt Swish immediately because it requires no new hyperparameters beyond the single scalar beta.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The search framework could be reused to discover task-specific activation functions rather than a single universal one.
Because Swish is smooth and non-monotonic near zero, it may interact differently with gradient-based optimizers than ReLU does.
Similar automated search might be applied to other low-level choices such as normalization layers or loss functions.

Load-bearing premise

The performance gains come from intrinsic properties of the discovered functions rather than from interactions with the specific model architectures, training schedules, or hyperparameter settings used in the tests.

What would settle it

A controlled study that swaps Swish into a broad collection of models while holding all other training details fixed and finds no consistent accuracy improvement would show the advantage is not general.

read the original abstract

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Swish edges out ReLU by small margins on a couple ImageNet models after automated search, but the gains rest on baselines that were never retuned for the new function.

read the letter

The main thing to know is that this paper uses a mix of exhaustive search and reinforcement learning to hunt for activation functions and lands on Swish, x times sigmoid of beta x, which beats plain ReLU by 0.9% top-1 on Mobile NASNet-A and 0.6% on Inception-ResNet-v2 when you just drop it in. They also report similar small lifts on other datasets and deeper models. That search approach is genuinely new relative to the hand-designed options like Leaky ReLU or ELU that came before, and the resulting function is simple enough that practitioners could try it without changing much else in their code. The paper does a decent job of showing the search actually produces usable candidates and that Swish is not wildly different from ReLU in shape, which helps explain why it trains stably. The soft spot is the comparison protocol. All runs keep the identical optimizer, learning-rate schedule, batch size, and initialization that were chosen for ReLU. There is no evidence they re-optimized those choices for Swish, so the reported deltas could shrink or vanish once each activation gets its own best training setup. The abstract and experiments do not include ablations on schedule sensitivity or statistical controls across multiple random seeds, which leaves the attribution to the functional form itself a bit shaky. This work is aimed at people building or tuning deep vision models who are open to trying a drop-in replacement that might help a little at negligible cost. It is solid enough on the method side to deserve peer review, though any referee will want clearer controls on the training hyperparameters before accepting the performance claims at face value.

Referee Report

2 major / 3 minor

Summary. The paper proposes leveraging automatic search techniques, including exhaustive and reinforcement learning-based methods, to discover new activation functions for deep neural networks. The best discovered function, named Swish and defined as f(x) = x · sigmoid(βx), is empirically shown to outperform the standard ReLU activation on deeper models across challenging datasets, with specific improvements such as 0.9% top-1 accuracy gain on ImageNet for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 upon simple replacement.

Significance. If the empirical results are robust to hyperparameter choices, this work is significant in providing a simple yet effective alternative to ReLU that practitioners can easily adopt. The automated search approach offers a systematic alternative to hand-designed activations and demonstrates concrete gains on standard benchmarks like ImageNet, which is a strength of the manuscript.

major comments (2)

[§4] §4: The experiments report accuracy improvements by replacing ReLU with Swish while retaining identical training schedules, optimizer, learning-rate schedule, batch size, and initialization chosen for ReLU. Without per-activation hyperparameter re-optimization or ablation on schedule sensitivity, it remains unclear whether the reported gains (0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 on ImageNet) are caused by the intrinsic properties of Swish or by interactions with ReLU-tuned protocols.
[§3] §3: The search space definition, number of trials, and statistical controls for both the exhaustive and RL-based searches are insufficiently detailed. This affects assessment of whether Swish was reliably identified as superior rather than selected post-hoc from a large set of candidates.

minor comments (3)

The value of β used in the reported Swish experiments should be explicitly stated (fixed or learned) along with sensitivity analysis.
Additional citations to prior parametric activation functions (e.g., PReLU) would strengthen the related-work discussion.
Figures comparing activation functions would benefit from including their derivatives to illustrate effects on gradient flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4] §4: The experiments report accuracy improvements by replacing ReLU with Swish while retaining identical training schedules, optimizer, learning-rate schedule, batch size, and initialization chosen for ReLU. Without per-activation hyperparameter re-optimization or ablation on schedule sensitivity, it remains unclear whether the reported gains (0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 on ImageNet) are caused by the intrinsic properties of Swish or by interactions with ReLU-tuned protocols.

Authors: We agree that the hyperparameters and training schedules were those originally tuned for ReLU, and no per-activation re-optimization or schedule ablation was performed. This design choice was made to evaluate Swish as a drop-in replacement that requires no additional tuning effort from practitioners. The consistent gains across two distinct architectures (Mobile NASNet-A and Inception-ResNet-v2) and multiple datasets provide supporting evidence that the improvements are not solely due to protocol interactions. Nevertheless, we acknowledge the limitation noted by the referee. In the revised manuscript we will add an explicit discussion of this point in Section 4 and include a limited ablation study on learning-rate sensitivity for Swish versus ReLU using a smaller proxy task. revision: partial
Referee: [§3] §3: The search space definition, number of trials, and statistical controls for both the exhaustive and RL-based searches are insufficiently detailed. This affects assessment of whether Swish was reliably identified as superior rather than selected post-hoc from a large set of candidates.

Authors: We thank the referee for highlighting the need for greater transparency in the search methodology. The original Section 3 described the overall approach at a high level but omitted precise specifications of the search space, exact trial counts, and statistical safeguards. In the revision we will expand Section 3 to enumerate the full set of unary and binary operations, the constants considered, the total number of functions evaluated in the exhaustive search, the RL training details (agent architecture, number of episodes, and reward formulation), and any repeated runs or variance metrics used to rank candidates. These additions will allow readers to evaluate the reliability of Swish's selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search discovery followed by independent benchmark evaluation

full rationale

The paper's chain consists of (1) applying exhaustive and RL-based search over activation function spaces to identify candidates, (2) selecting the best performer f(x) = x · sigmoid(βx), and (3) measuring its accuracy when substituted into fixed architectures on held-out datasets such as ImageNet. None of these steps reduces a reported performance delta to a quantity defined by the same fitted parameters or search objective used to generate the candidate; the gains are direct empirical measurements on standard test sets under the paper's stated protocol. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no equation equates a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that functions found by the described search procedure will generalize and that the reported accuracy deltas are attributable to the activation choice.

free parameters (1)

β
Scaling parameter inside the discovered Swish form; its value is part of the function definition and may have been chosen or optimized during search.

axioms (1)

domain assumption The space of functions considered during search contains useful activation functions that transfer to held-out models and tasks.
Invoked when the authors treat the search output as a general recommendation rather than a one-off result.

pith-pipeline@v0.9.0 · 5516 in / 1328 out tokens · 56415 ms · 2026-05-12T02:43:54.485348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experiments show that the best discovered activation function, f(x) = x · sigmoid(βx), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets.
Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
cs.LG 2026-05 unverdicted novelty 8.0

Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
KAN: Kolmogorov-Arnold Networks
cs.LG 2024-04 conditional novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Neural Statistical Functions
cs.LG 2026-05 unverdicted novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 7.0

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
Selectivity and Shape in the Design of Forward-Forward Goodness Functions
cs.LG 2026-03 unverdicted novelty 7.0

Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
cs.LG 2026-03 unverdicted novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
math.OC 2026-05 unverdicted novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Sparse MoE in FFN blocks redistributes computation to attention in small Transformers primarily due to architectural capacity reduction and partitioning, not learned router specialization.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
cs.LG 2026-05 unverdicted novelty 6.0

MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
cs.LG 2026-05 unverdicted novelty 6.0

MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
cs.LG 2026-05 unverdicted novelty 6.0

MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
On the Blessing of Pre-training in Weak-to-Strong Generalization
cs.LG 2026-05 unverdicted novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Competing nonlinearities, criticality, and order-to-chaos transition in deep networks
cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

A statistical mixture of Tanh and Swish activations with critical mixing fraction p_c induces a continuous phase transition to scale-invariant signal propagation in deep networks while preserving smoothness.
Neural-network reconstruction of THz transmission spectra using electrically tunable AlGaN/GaN plasmonic-crystal analyzer
physics.optics 2026-05 unverdicted novelty 6.0

A feedforward neural network trained on synthetic data inverts voltage-dependent intensities from an electrically tunable AlGaN/GaN plasmonic analyzer to reconstruct THz spectra, achieving lower error than Tikhonov re...
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics
cs.LG 2026-05 unverdicted novelty 6.0

EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10...
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
cs.CL 2026-05 unverdicted novelty 6.0

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks
nucl-th 2026-04 unverdicted novelty 6.0

A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
cs.LG 2026-04 unverdicted novelty 6.0

GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
A Complex-Valued Continuous-Variable Quantum Approximation Optimization Algorithm (CCV-QAOA)
quant-ph 2026-04 unverdicted novelty 6.0

CCV-QAOA is a new complex-valued continuous-variable variant of QAOA that solves real and complex multivariate optimization problems via a variational framework.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 6.0

Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning
physics.ao-ph 2026-04 conditional novelty 6.0

Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the N...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices
math.NA 2026-05 unverdicted novelty 5.0

A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
cs.CV 2026-05 unverdicted novelty 5.0

LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
cs.AI 2026-05 unverdicted novelty 5.0

BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
cs.CL 2026-05 unverdicted novelty 5.0

AGoQ cuts LLM training memory by up to 52% and speeds it up by 1.34x using tailored 4-bit activations and 8-bit gradients with special communication, matching baseline accuracy on LLaMA models.
GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories
cs.RO 2026-04 unverdicted novelty 5.0

GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks ...
Physics-informed neural networks for form-finding of unilateral membrane structures
cs.CE 2026-04 unverdicted novelty 5.0

PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications
cs.LG 2026-04 unverdicted novelty 5.0

ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Agentic Risk-Aware Set-Based Engineering Design
cs.AI 2026-04 unverdicted novelty 4.0

Multi-agent LLM system applies set-based design and Conditional Value-at-Risk to explore and risk-filter airfoil designs with human manager coordination.
GLU Variants Improve Transformer
cs.LG 2020-02 unverdicted novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance
econ.GN 2026-05 unverdicted novelty 3.0

The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
cs.CV 2026-05 unverdicted novelty 3.0

A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 37 Pith papers · 4 internal anchors

[1]

Learning activation functions to improve deep neural networks

Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830,

work page arXiv
[2]

Reinforcement learning for architecture search by network transformation

Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873,

work page arXiv
[3]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,

work page Pith review arXiv
[4]

Language Modeling with Gated Convolutional Networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083,

work page Pith review arXiv
[5]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforce- ment learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,

work page Pith review arXiv
[6]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118,

work page arXiv
[7]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400,

work page Pith review arXiv
[8]

HyperNetworks

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,

work page internal anchor Pith review arXiv
[9]

Gaussian Error Linear Units (GELUs)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–64...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision ap- plications. arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision ,

Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision ,

work page 2009
[12]

Self-normalizing neural net- works

G¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural net- works. arXiv preprint arXiv:1706.02515,

work page arXiv
[13]

Learnable pooling with context gating for video classiﬁcation

Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classiﬁcation. arXiv preprint arXiv:1706.06905,

work page arXiv
[14]

Flexible rectiﬁed linear units for improving convolutional neural networks

Suo Qiu and Bolun Cai. Flexible rectiﬁed linear units for improving convolutional neural networks. arXiv preprint arXiv:1706.08098,

work page arXiv
[15]

Large-scale evolution of image classiﬁers

Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classiﬁers. arXiv preprint arXiv:1703.01041,

work page arXiv
[16]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Highway Networks

Rupesh Kumar Srivastava, Klaus Greff, and J ¨urgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387,

work page Pith review arXiv
[18]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763,

work page Pith review arXiv
[19]

Empirical evalua- tion of rectified activations in convolutional network

12 Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectiﬁed activations in convolutional network. arXiv preprint arXiv:1505.00853,

work page arXiv
[20]

Practical network blocks design with q-learning

Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552,

work page arXiv
[21]

Deep interest network for click-through rate prediction

Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Xingya Dai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. arXiv preprint arXiv:1706.06978,

work page arXiv
[22]

Learning transferable architectures for scal- able image recognition

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scal- able image recognition. arXiv preprint arXiv:1707.07012,

work page arXiv