Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

Jinha Kim; Unsang Park; Youngmin Seo

arxiv: 2407.01012 · v3 · submitted 2024-07-01 · 💻 cs.LG · cs.CV

Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

Youngmin Seo , Jinha Kim , Unsang Park This is my paper

Pith reviewed 2026-05-23 23:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords Swish activationTanh biasactivation functionsneural networksimage classificationMNISTCIFAR-10

0 comments

The pith

Adding a Tanh bias to Swish produces activation variants that outperform the original on image classification benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Swish-T family by adding a Tanh bias term to the standard Swish activation function. This addition is intended to let the function accept a wider range of negative inputs during early training and to create a smoother non-monotonic shape. The authors introduce several variants and single out Swish-T_C as the main recommendation, while noting that Swish-T and Swish-T_B also perform well. Experiments on MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100 are presented to show the empirical gains. An ablation study further indicates that Swish-T_C works effectively even in its non-parametric form.

Core claim

Swish-T is obtained by adding a Tanh bias to the original Swish function, yielding a family of activation functions whose variants deliver higher accuracy than Swish across the tested models and datasets. Swish-T_C is advanced as the primary choice, with the other two variants also showing competitive results. The Tanh bias is described as enabling broader negative-value acceptance at the start of training and producing a smoother curve overall.

What carries the argument

The Swish-T activation, formed by adding a Tanh bias term to Swish, which alters the negative-region behavior to support wider acceptance of negative values during initial training stages.

If this is right

Swish-T_C can serve as a direct replacement for Swish in convolutional and other neural network models.
The family provides task-dependent advantages, with different variants suited to different datasets or architectures.
Swish-T_C retains strong performance when used without extra learnable parameters.
The modification applies across multiple standard image-classification benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bias additions could be tested on other non-monotonic activations to adjust their negative-region behavior.
The smoother curve may reduce training instability in deeper networks, though this is not measured in the paper.
The approach suggests a simple way to extend existing activations without introducing new learnable parameters.

Load-bearing premise

Performance differences are caused by the Tanh bias addition rather than differences in hyper-parameter tuning, random seeds, or other unstated implementation choices.

What would settle it

Re-running the exact same model architectures and training schedules on the same datasets but with the original Swish function substituted for Swish-T_C and checking whether the reported accuracy gaps disappear.

Figures

Figures reproduced from arXiv: 2407.01012 by Jinha Kim, Unsang Park, Youngmin Seo.

**Figure 2.** Figure 2: Swish-TC, Swish activation function and first derivatives. (a) Swish-TC activation function with fixed alpha and beta. (b) The first derivatives with fixed alpha=0.5 and different betas. Beta controls how quickly the first derivative reaches the upper/lower asymptotes. (c) Alpha determines the upper/lower bounds of the first derivative. Swish offers several advantages: • Smooth Nonlinearity: Swish is a smo… view at source ↗

**Figure 3.** Figure 3: Train and test curves for ShuffleNetv2 (2.x) on the CIFAR100 dataset. This figure shows the comparison of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average training time for SENet-18 and DenseNet-121 on CIFAR-10 using a single GPU. (Performance [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T$_{\textbf{C}}$ function, while Swish-T and Swish-T$_{\textbf{B}}$, byproducts of Swish-T$_{\textbf{C}}$, also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T$_{\textbf{C}}$ as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at https://github.com/ictseoyoungmin/Swish-T-pytorch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Swish-T adds a tanh bias to Swish and reports accuracy gains on image benchmarks, but single-run point estimates without variance or matched hyperparameter controls make the attribution unreliable.

read the letter

The paper defines Swish-T by adding a tanh bias term to the standard Swish activation, producing variants including Swish-T_C that they recommend. They also release PyTorch code and run an ablation on the non-parametric form. Those are the concrete pieces that are new: the specific functional forms and the public implementation. The experiments cover MNIST, Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100 across a few models, which is standard for this kind of tweak. The ablation showing decent results without the extra parameters is a reasonable check. Beyond that the work is incremental; it does not derive the bias from first principles or prove any general property. The central weakness is the experimental reporting. The abstract and stress-test note give no indication of multiple random seeds, standard deviations, or identical hyperparameter search budgets for each activation function. Without those controls, any accuracy delta can be explained by optimization noise rather than the tanh addition. The paper does not appear to address this directly. This kind of activation tweak is mainly of interest to practitioners who routinely test small changes on vision datasets and want a ready-to-use variant with code. It does not contain enough new theory or robust evidence to justify referee time. I would not bring it to a reading group or cite it. Skip peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Swish-T family of activation functions obtained by adding a scaled Tanh bias term to the standard Swish function, yielding three variants (Swish-T, Swish-T_B, Swish-T_C). It claims that Swish-T_C (and to a lesser extent the other variants) empirically outperforms Swish on MNIST, Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100 across several CNN and MLP architectures, with an additional ablation showing that the non-parametric form of Swish-T_C remains competitive. Public code is provided.

Significance. A well-controlled demonstration that a simple, fixed modification to Swish yields consistent gains would be a modest but useful contribution to the activation-function literature. The current manuscript, however, supplies only single-run point estimates; therefore the practical significance cannot yet be assessed.

major comments (2)

[§4] §4 (Experiments) and all result tables: every reported accuracy is a single-run point estimate with no standard deviation, no multiple random seeds, and no statement that an identical hyper-parameter search budget was used for Swish versus each Swish-T variant. Because the central claim is empirical superiority, the absence of these controls makes it impossible to attribute observed deltas to the Tanh bias rather than optimization stochasticity.
[§4.2] §4.2 (Ablation study): the non-parametric Swish-T_C is compared only against the single-run Swish baseline; the same statistical-control issues therefore apply and the ablation does not rescue the main claim.

minor comments (2)

[§3] The definition of the scaling coefficients for the Tanh bias (Eq. 3–5) is clear, but the manuscript never states whether these coefficients are learned or fixed; a single clarifying sentence would remove ambiguity.
[Figure 1] Figure 1 caption should explicitly label the three curves as Swish-T, Swish-T_B and Swish-T_C rather than relying on the legend alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of statistical robustness in the experimental evaluation. We address the two major comments point-by-point below and commit to the necessary revisions.

read point-by-point responses

Referee: [§4] §4 (Experiments) and all result tables: every reported accuracy is a single-run point estimate with no standard deviation, no multiple random seeds, and no statement that an identical hyper-parameter search budget was used for Swish versus each Swish-T variant. Because the central claim is empirical superiority, the absence of these controls makes it impossible to attribute observed deltas to the Tanh bias rather than optimization stochasticity.

Authors: We agree that single-run point estimates are insufficient to support claims of consistent superiority. In the revised manuscript we will rerun all experiments using at least five independent random seeds, report mean accuracy and standard deviation for every table entry, and explicitly document that the hyper-parameter search budget and protocol were identical across Swish and all Swish-T variants. revision: yes
Referee: [§4.2] §4.2 (Ablation study): the non-parametric Swish-T_C is compared only against the single-run Swish baseline; the same statistical-control issues therefore apply and the ablation does not rescue the main claim.

Authors: The same limitation applies to the ablation study. We will repeat the non-parametric ablation with multiple random seeds and report means and standard deviations so that the ablation results are statistically comparable to the main experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on direct benchmark comparisons

full rationale

The paper proposes Swish-T by adding a Tanh bias term to the Swish activation and asserts superiority via reported accuracies on MNIST, Fashion-MNIST, SVHN, CIFAR-10/100. No derivation, uniqueness theorem, fitted-parameter prediction, or self-citation chain is present; the central claim is a set of point-estimate experimental results rather than any quantity that reduces to the paper's own definitions or inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The proposal introduces a parametric bias term whose specific coefficients appear chosen to match observed performance; no independent derivation or external benchmark is supplied in the abstract.

free parameters (1)

Tanh bias scaling coefficients
The abstract describes a family of functions whose exact shape depends on bias parameters whose values are selected for each variant.

axioms (1)

standard math Standard mathematical definitions and derivatives of sigmoid and tanh functions hold.
Invoked implicitly when defining the new activation and its gradient.

pith-pipeline@v0.9.0 · 5736 in / 1104 out tokens · 29729 ms · 2026-05-23T23:02:26.804088+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit

Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947–951, 2000

work page 2000
[2]

What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153

Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153. IEEE, 2009

work page 2024
[3]

Rectified linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

work page 2010
[4]

Maas, Awni Y

Andrew L. Maas, Awni Y . Hannun, and Andrew Y . Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013

work page 2013
[5]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015

work page 2015
[6]

Gaussian Error Linear Units (GELUs), 2023

Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs), 2023

work page 2023
[7]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for Activation Functions, 2017

work page 2017
[8]

Activate or Not: Learning Customized Activation

Ningning Ma, Xiangyu Zhang, Ming Liu, and Jian Sun. Activate or Not: Learning Customized Activation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021

work page 2021
[9]

ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions

Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6097–6105, June 2022

work page 2022
[10]

Mish: A Self Regularized Non-Monotonic Activation Function, 2020

Diganta Misra. Mish: A Self Regularized Non-Monotonic Activation Function, 2020

work page 2020
[11]

Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique

Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 784–793, 2022

work page 2022
[12]

Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout Networks, 2013

work page 2013
[13]

E-swish: Adjusting Activations to Different Network Depths, 2018

Eric Alcaide. E-swish: Adjusting Activations to Different Network Depths, 2018

work page 2018
[14]

P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning

Marina Adriana Mercioni and Stefan Holban. P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning. In 2020 International Symposium on Electronics and Telecommuni- cations (ISETC), pages 1–4, 2020

work page 2020
[15]

Soft-Clipping Swish: A Novel Activation Function for Deep Learning

Marina Adriana Mercioni and Stefan Holban. Soft-Clipping Swish: A Novel Activation Function for Deep Learning. In 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI), pages 225–230, 2021

work page 2021
[16]

Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening

Natinai Jinsakul, Cheng-Fa Tsai, Chia-En Tsai, and Pensee Wu. Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening. Mathematics, 7(12), 2019

work page 2019
[17]

Mingxing Tan, Ruoming Pang, and Quoc V . Le. EfficientDet: Scalable and Efficient Object Detection. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020

work page 2020
[18]

Skin Cancer Detection using CNN with Swish Activation Function

Misba Farheen, M Manjushree, and Manish Kumar Pandit. Skin Cancer Detection using CNN with Swish Activation Function. 2020

work page 2020
[19]

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Steffen Eger, Paul Youssef, and Iryna Gurevych. Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018

work page 2018
[20]

Bias Also Matters: Bias Attribution for Deep Neural Network Explanation

Shengjie Wang, Tianyi Zhou, and Jeff Bilmes. Bias Also Matters: Bias Attribution for Deep Neural Network Explanation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6659–6667. PMLR, 09–15 Jun 2019

work page 2019
[21]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[22]

Deep Residual Learning for Image Recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015

work page 2015
[23]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015
[24]

PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...

work page 2019
[25]

Beyond regression : new tools for prediction and analysis in the behavioral sciences /

Paul Werbos and Paul John. Beyond regression : new tools for prediction and analysis in the behavioral sciences /. 01 1974

work page 1974
[26]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, November 2018

work page 2018
[27]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, 1989

work page 1989
[28]

MNIST handwritten digit database

Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

work page 2010
[29]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Reading Digits in Natural Images with Unsupervised Feature Learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. 2011

work page 2011
[31]

Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989

Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989

work page 1989
[32]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951

work page 1951
[33]

Kiefer and J

J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. Annals of Mathematical Statistics, 23:462–466, 1952

work page 1952
[34]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Shufflenet v2: Practical guidelines for efficient cnn architecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018

work page 2018
[36]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

work page 2018
[37]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019
[38]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

work page 2018
[39]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 11

work page 2017

[1] [1]

Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit

Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947–951, 2000

work page 2000

[2] [2]

What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153

Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153. IEEE, 2009

work page 2024

[3] [3]

Rectified linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

work page 2010

[4] [4]

Maas, Awni Y

Andrew L. Maas, Awni Y . Hannun, and Andrew Y . Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013

work page 2013

[5] [5]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015

work page 2015

[6] [6]

Gaussian Error Linear Units (GELUs), 2023

Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs), 2023

work page 2023

[7] [7]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for Activation Functions, 2017

work page 2017

[8] [8]

Activate or Not: Learning Customized Activation

Ningning Ma, Xiangyu Zhang, Ming Liu, and Jian Sun. Activate or Not: Learning Customized Activation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021

work page 2021

[9] [9]

ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions

Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6097–6105, June 2022

work page 2022

[10] [10]

Mish: A Self Regularized Non-Monotonic Activation Function, 2020

Diganta Misra. Mish: A Self Regularized Non-Monotonic Activation Function, 2020

work page 2020

[11] [11]

Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique

Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 784–793, 2022

work page 2022

[12] [12]

Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout Networks, 2013

work page 2013

[13] [13]

E-swish: Adjusting Activations to Different Network Depths, 2018

Eric Alcaide. E-swish: Adjusting Activations to Different Network Depths, 2018

work page 2018

[14] [14]

P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning

Marina Adriana Mercioni and Stefan Holban. P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning. In 2020 International Symposium on Electronics and Telecommuni- cations (ISETC), pages 1–4, 2020

work page 2020

[15] [15]

Soft-Clipping Swish: A Novel Activation Function for Deep Learning

Marina Adriana Mercioni and Stefan Holban. Soft-Clipping Swish: A Novel Activation Function for Deep Learning. In 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI), pages 225–230, 2021

work page 2021

[16] [16]

Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening

Natinai Jinsakul, Cheng-Fa Tsai, Chia-En Tsai, and Pensee Wu. Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening. Mathematics, 7(12), 2019

work page 2019

[17] [17]

Mingxing Tan, Ruoming Pang, and Quoc V . Le. EfficientDet: Scalable and Efficient Object Detection. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020

work page 2020

[18] [18]

Skin Cancer Detection using CNN with Swish Activation Function

Misba Farheen, M Manjushree, and Manish Kumar Pandit. Skin Cancer Detection using CNN with Swish Activation Function. 2020

work page 2020

[19] [19]

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Steffen Eger, Paul Youssef, and Iryna Gurevych. Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018

work page 2018

[20] [20]

Bias Also Matters: Bias Attribution for Deep Neural Network Explanation

Shengjie Wang, Tianyi Zhou, and Jeff Bilmes. Bias Also Matters: Bias Attribution for Deep Neural Network Explanation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6659–6667. PMLR, 09–15 Jun 2019

work page 2019

[21] [21]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[22] [22]

Deep Residual Learning for Image Recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015

work page 2015

[23] [23]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015

[24] [24]

PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...

work page 2019

[25] [25]

Beyond regression : new tools for prediction and analysis in the behavioral sciences /

Paul Werbos and Paul John. Beyond regression : new tools for prediction and analysis in the behavioral sciences /. 01 1974

work page 1974

[26] [26]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, November 2018

work page 2018

[27] [27]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, 1989

work page 1989

[28] [28]

MNIST handwritten digit database

Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

work page 2010

[29] [29]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Reading Digits in Natural Images with Unsupervised Feature Learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. 2011

work page 2011

[31] [31]

Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989

Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989

work page 1989

[32] [32]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951

work page 1951

[33] [33]

Kiefer and J

J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. Annals of Mathematical Statistics, 23:462–466, 1952

work page 1952

[34] [34]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Shufflenet v2: Practical guidelines for efficient cnn architecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018

work page 2018

[36] [36]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

work page 2018

[37] [37]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019

[38] [38]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

work page 2018

[39] [39]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 11

work page 2017