Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Iryna Gurevych; Paul Youssef; Steffen Eger

arxiv: 1901.02671 · v1 · pith:MZ5TCWMYnew · submitted 2019-01-09 · 💻 cs.CL · cs.LG· cs.NE

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Steffen Eger , Paul Youssef , Iryna Gurevych This is my paper

classification 💻 cs.CL cs.LGcs.NE

keywords activationfunctionstasksacrossbeencompetitorsdeepfunction

0 comments

read the original abstract

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. One of the currently most popular activation functions is ReLU, but several competitors have recently been proposed or 'discovered', including LReLU functions and swish. While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first large-scale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across all tasks, the so-called penalized tanh function. We also show that it can successfully replace the sigmoid and tanh gates in LSTM cells, leading to a 2 percentage point (pp) improvement over the standard choices on a challenging NLP task.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Learning for CSI Feedback Based on Superimposed Coding
cs.NI 2019-07 unverdicted novelty 5.0

A multi-task neural network recovers superimposed downlink CSI and uplink data sequences in FDD massive MIMO, improving CSI estimation over standalone SC while maintaining similar UL-US detection across varying SNR and PPC.