pith. sign in

arxiv: 2407.01012 · v3 · submitted 2024-07-01 · 💻 cs.LG · cs.CV

Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

Pith reviewed 2026-05-23 23:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords Swish activationTanh biasactivation functionsneural networksimage classificationMNISTCIFAR-10
0
0 comments X

The pith

Adding a Tanh bias to Swish produces activation variants that outperform the original on image classification benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Swish-T family by adding a Tanh bias term to the standard Swish activation function. This addition is intended to let the function accept a wider range of negative inputs during early training and to create a smoother non-monotonic shape. The authors introduce several variants and single out Swish-T_C as the main recommendation, while noting that Swish-T and Swish-T_B also perform well. Experiments on MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100 are presented to show the empirical gains. An ablation study further indicates that Swish-T_C works effectively even in its non-parametric form.

Core claim

Swish-T is obtained by adding a Tanh bias to the original Swish function, yielding a family of activation functions whose variants deliver higher accuracy than Swish across the tested models and datasets. Swish-T_C is advanced as the primary choice, with the other two variants also showing competitive results. The Tanh bias is described as enabling broader negative-value acceptance at the start of training and producing a smoother curve overall.

What carries the argument

The Swish-T activation, formed by adding a Tanh bias term to Swish, which alters the negative-region behavior to support wider acceptance of negative values during initial training stages.

If this is right

  • Swish-T_C can serve as a direct replacement for Swish in convolutional and other neural network models.
  • The family provides task-dependent advantages, with different variants suited to different datasets or architectures.
  • Swish-T_C retains strong performance when used without extra learnable parameters.
  • The modification applies across multiple standard image-classification benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar bias additions could be tested on other non-monotonic activations to adjust their negative-region behavior.
  • The smoother curve may reduce training instability in deeper networks, though this is not measured in the paper.
  • The approach suggests a simple way to extend existing activations without introducing new learnable parameters.

Load-bearing premise

Performance differences are caused by the Tanh bias addition rather than differences in hyper-parameter tuning, random seeds, or other unstated implementation choices.

What would settle it

Re-running the exact same model architectures and training schedules on the same datasets but with the original Swish function substituted for Swish-T_C and checking whether the reported accuracy gaps disappear.

Figures

Figures reproduced from arXiv: 2407.01012 by Jinha Kim, Unsang Park, Youngmin Seo.

Figure 1
Figure 1. Figure 1: Comparison of Various Activation Functions including Sigmoid, ReLU, Leaky ReLU, GELU, Swish, Mish, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Swish-TC, Swish activation function and first derivatives. (a) Swish-TC activation function with fixed alpha and beta. (b) The first derivatives with fixed alpha=0.5 and different betas. Beta controls how quickly the first derivative reaches the upper/lower asymptotes. (c) Alpha determines the upper/lower bounds of the first derivative. Swish offers several advantages: • Smooth Nonlinearity: Swish is a smo… view at source ↗
Figure 3
Figure 3. Figure 3: Train and test curves for ShuffleNetv2 (2.x) on the CIFAR100 dataset. This figure shows the comparison of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average training time for SENet-18 and DenseNet-121 on CIFAR-10 using a single GPU. (Performance [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T$_{\textbf{C}}$ function, while Swish-T and Swish-T$_{\textbf{B}}$, byproducts of Swish-T$_{\textbf{C}}$, also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T$_{\textbf{C}}$ as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at https://github.com/ictseoyoungmin/Swish-T-pytorch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Swish-T family of activation functions obtained by adding a scaled Tanh bias term to the standard Swish function, yielding three variants (Swish-T, Swish-T_B, Swish-T_C). It claims that Swish-T_C (and to a lesser extent the other variants) empirically outperforms Swish on MNIST, Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100 across several CNN and MLP architectures, with an additional ablation showing that the non-parametric form of Swish-T_C remains competitive. Public code is provided.

Significance. A well-controlled demonstration that a simple, fixed modification to Swish yields consistent gains would be a modest but useful contribution to the activation-function literature. The current manuscript, however, supplies only single-run point estimates; therefore the practical significance cannot yet be assessed.

major comments (2)
  1. [§4] §4 (Experiments) and all result tables: every reported accuracy is a single-run point estimate with no standard deviation, no multiple random seeds, and no statement that an identical hyper-parameter search budget was used for Swish versus each Swish-T variant. Because the central claim is empirical superiority, the absence of these controls makes it impossible to attribute observed deltas to the Tanh bias rather than optimization stochasticity.
  2. [§4.2] §4.2 (Ablation study): the non-parametric Swish-T_C is compared only against the single-run Swish baseline; the same statistical-control issues therefore apply and the ablation does not rescue the main claim.
minor comments (2)
  1. [§3] The definition of the scaling coefficients for the Tanh bias (Eq. 3–5) is clear, but the manuscript never states whether these coefficients are learned or fixed; a single clarifying sentence would remove ambiguity.
  2. [Figure 1] Figure 1 caption should explicitly label the three curves as Swish-T, Swish-T_B and Swish-T_C rather than relying on the legend alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of statistical robustness in the experimental evaluation. We address the two major comments point-by-point below and commit to the necessary revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and all result tables: every reported accuracy is a single-run point estimate with no standard deviation, no multiple random seeds, and no statement that an identical hyper-parameter search budget was used for Swish versus each Swish-T variant. Because the central claim is empirical superiority, the absence of these controls makes it impossible to attribute observed deltas to the Tanh bias rather than optimization stochasticity.

    Authors: We agree that single-run point estimates are insufficient to support claims of consistent superiority. In the revised manuscript we will rerun all experiments using at least five independent random seeds, report mean accuracy and standard deviation for every table entry, and explicitly document that the hyper-parameter search budget and protocol were identical across Swish and all Swish-T variants. revision: yes

  2. Referee: [§4.2] §4.2 (Ablation study): the non-parametric Swish-T_C is compared only against the single-run Swish baseline; the same statistical-control issues therefore apply and the ablation does not rescue the main claim.

    Authors: The same limitation applies to the ablation study. We will repeat the non-parametric ablation with multiple random seeds and report means and standard deviations so that the ablation results are statistically comparable to the main experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on direct benchmark comparisons

full rationale

The paper proposes Swish-T by adding a Tanh bias term to the Swish activation and asserts superiority via reported accuracies on MNIST, Fashion-MNIST, SVHN, CIFAR-10/100. No derivation, uniqueness theorem, fitted-parameter prediction, or self-citation chain is present; the central claim is a set of point-estimate experimental results rather than any quantity that reduces to the paper's own definitions or inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The proposal introduces a parametric bias term whose specific coefficients appear chosen to match observed performance; no independent derivation or external benchmark is supplied in the abstract.

free parameters (1)
  • Tanh bias scaling coefficients
    The abstract describes a family of functions whose exact shape depends on bias parameters whose values are selected for each variant.
axioms (1)
  • standard math Standard mathematical definitions and derivatives of sigmoid and tanh functions hold.
    Invoked implicitly when defining the new activation and its gradient.

pith-pipeline@v0.9.0 · 5736 in / 1104 out tokens · 29729 ms · 2026-05-23T23:02:26.804088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit

    Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947–951, 2000

  2. [2]

    What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153

    Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153. IEEE, 2009

  3. [3]

    Rectified linear units improve restricted boltzmann machines

    Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

  4. [4]

    Maas, Awni Y

    Andrew L. Maas, Awni Y . Hannun, and Andrew Y . Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013

  5. [5]

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015

  6. [6]

    Gaussian Error Linear Units (GELUs), 2023

    Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs), 2023

  7. [7]

    Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for Activation Functions, 2017

  8. [8]

    Activate or Not: Learning Customized Activation

    Ningning Ma, Xiangyu Zhang, Ming Liu, and Jian Sun. Activate or Not: Learning Customized Activation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021

  9. [9]

    ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions

    Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6097–6105, June 2022

  10. [10]

    Mish: A Self Regularized Non-Monotonic Activation Function, 2020

    Diganta Misra. Mish: A Self Regularized Non-Monotonic Activation Function, 2020

  11. [11]

    Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique

    Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 784–793, 2022

  12. [12]

    Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout Networks, 2013

  13. [13]

    E-swish: Adjusting Activations to Different Network Depths, 2018

    Eric Alcaide. E-swish: Adjusting Activations to Different Network Depths, 2018

  14. [14]

    P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning

    Marina Adriana Mercioni and Stefan Holban. P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning. In 2020 International Symposium on Electronics and Telecommuni- cations (ISETC), pages 1–4, 2020

  15. [15]

    Soft-Clipping Swish: A Novel Activation Function for Deep Learning

    Marina Adriana Mercioni and Stefan Holban. Soft-Clipping Swish: A Novel Activation Function for Deep Learning. In 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI), pages 225–230, 2021

  16. [16]

    Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening

    Natinai Jinsakul, Cheng-Fa Tsai, Chia-En Tsai, and Pensee Wu. Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening. Mathematics, 7(12), 2019

  17. [17]

    Mingxing Tan, Ruoming Pang, and Quoc V . Le. EfficientDet: Scalable and Efficient Object Detection. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020

  18. [18]

    Skin Cancer Detection using CNN with Swish Activation Function

    Misba Farheen, M Manjushree, and Manish Kumar Pandit. Skin Cancer Detection using CNN with Swish Activation Function. 2020

  19. [19]

    Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

    Steffen Eger, Paul Youssef, and Iryna Gurevych. Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018

  20. [20]

    Bias Also Matters: Bias Attribution for Deep Neural Network Explanation

    Shengjie Wang, Tianyi Zhou, and Jeff Bilmes. Bias Also Matters: Bias Attribution for Deep Neural Network Explanation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6659–6667. PMLR, 09–15 Jun 2019

  21. [21]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  22. [22]

    Deep Residual Learning for Image Recognition, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015

  23. [23]

    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

  24. [24]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...

  25. [25]

    Beyond regression : new tools for prediction and analysis in the behavioral sciences /

    Paul Werbos and Paul John. Beyond regression : new tools for prediction and analysis in the behavioral sciences /. 01 1974

  26. [26]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, November 2018

  27. [27]

    LeCun, B

    Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, 1989

  28. [28]

    MNIST handwritten digit database

    Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

  29. [29]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017

  30. [30]

    Reading Digits in Natural Images with Unsupervised Feature Learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. 2011

  31. [31]

    Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989

    Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989

  32. [32]

    Robbins and S

    H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951

  33. [33]

    Kiefer and J

    J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. Annals of Mathematical Statistics, 23:462–466, 1952

  34. [34]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

  35. [35]

    Shufflenet v2: Practical guidelines for efficient cnn architecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018

  36. [36]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

  37. [37]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

  38. [38]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

  39. [39]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 11