Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance
Pith reviewed 2026-05-23 23:02 UTC · model grok-4.3
The pith
Adding a Tanh bias to Swish produces activation variants that outperform the original on image classification benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Swish-T is obtained by adding a Tanh bias to the original Swish function, yielding a family of activation functions whose variants deliver higher accuracy than Swish across the tested models and datasets. Swish-T_C is advanced as the primary choice, with the other two variants also showing competitive results. The Tanh bias is described as enabling broader negative-value acceptance at the start of training and producing a smoother curve overall.
What carries the argument
The Swish-T activation, formed by adding a Tanh bias term to Swish, which alters the negative-region behavior to support wider acceptance of negative values during initial training stages.
If this is right
- Swish-T_C can serve as a direct replacement for Swish in convolutional and other neural network models.
- The family provides task-dependent advantages, with different variants suited to different datasets or architectures.
- Swish-T_C retains strong performance when used without extra learnable parameters.
- The modification applies across multiple standard image-classification benchmarks.
Where Pith is reading between the lines
- Similar bias additions could be tested on other non-monotonic activations to adjust their negative-region behavior.
- The smoother curve may reduce training instability in deeper networks, though this is not measured in the paper.
- The approach suggests a simple way to extend existing activations without introducing new learnable parameters.
Load-bearing premise
Performance differences are caused by the Tanh bias addition rather than differences in hyper-parameter tuning, random seeds, or other unstated implementation choices.
What would settle it
Re-running the exact same model architectures and training schedules on the same datasets but with the original Swish function substituted for Swish-T_C and checking whether the reported accuracy gaps disappear.
Figures
read the original abstract
We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T$_{\textbf{C}}$ function, while Swish-T and Swish-T$_{\textbf{B}}$, byproducts of Swish-T$_{\textbf{C}}$, also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T$_{\textbf{C}}$ as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at https://github.com/ictseoyoungmin/Swish-T-pytorch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Swish-T family of activation functions obtained by adding a scaled Tanh bias term to the standard Swish function, yielding three variants (Swish-T, Swish-T_B, Swish-T_C). It claims that Swish-T_C (and to a lesser extent the other variants) empirically outperforms Swish on MNIST, Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100 across several CNN and MLP architectures, with an additional ablation showing that the non-parametric form of Swish-T_C remains competitive. Public code is provided.
Significance. A well-controlled demonstration that a simple, fixed modification to Swish yields consistent gains would be a modest but useful contribution to the activation-function literature. The current manuscript, however, supplies only single-run point estimates; therefore the practical significance cannot yet be assessed.
major comments (2)
- [§4] §4 (Experiments) and all result tables: every reported accuracy is a single-run point estimate with no standard deviation, no multiple random seeds, and no statement that an identical hyper-parameter search budget was used for Swish versus each Swish-T variant. Because the central claim is empirical superiority, the absence of these controls makes it impossible to attribute observed deltas to the Tanh bias rather than optimization stochasticity.
- [§4.2] §4.2 (Ablation study): the non-parametric Swish-T_C is compared only against the single-run Swish baseline; the same statistical-control issues therefore apply and the ablation does not rescue the main claim.
minor comments (2)
- [§3] The definition of the scaling coefficients for the Tanh bias (Eq. 3–5) is clear, but the manuscript never states whether these coefficients are learned or fixed; a single clarifying sentence would remove ambiguity.
- [Figure 1] Figure 1 caption should explicitly label the three curves as Swish-T, Swish-T_B and Swish-T_C rather than relying on the legend alone.
Simulated Author's Rebuttal
We thank the referee for highlighting the importance of statistical robustness in the experimental evaluation. We address the two major comments point-by-point below and commit to the necessary revisions.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and all result tables: every reported accuracy is a single-run point estimate with no standard deviation, no multiple random seeds, and no statement that an identical hyper-parameter search budget was used for Swish versus each Swish-T variant. Because the central claim is empirical superiority, the absence of these controls makes it impossible to attribute observed deltas to the Tanh bias rather than optimization stochasticity.
Authors: We agree that single-run point estimates are insufficient to support claims of consistent superiority. In the revised manuscript we will rerun all experiments using at least five independent random seeds, report mean accuracy and standard deviation for every table entry, and explicitly document that the hyper-parameter search budget and protocol were identical across Swish and all Swish-T variants. revision: yes
-
Referee: [§4.2] §4.2 (Ablation study): the non-parametric Swish-T_C is compared only against the single-run Swish baseline; the same statistical-control issues therefore apply and the ablation does not rescue the main claim.
Authors: The same limitation applies to the ablation study. We will repeat the non-parametric ablation with multiple random seeds and report means and standard deviations so that the ablation results are statistically comparable to the main experiments. revision: yes
Circularity Check
No circularity; empirical claims rest on direct benchmark comparisons
full rationale
The paper proposes Swish-T by adding a Tanh bias term to the Swish activation and asserts superiority via reported accuracies on MNIST, Fashion-MNIST, SVHN, CIFAR-10/100. No derivation, uniqueness theorem, fitted-parameter prediction, or self-citation chain is present; the central claim is a set of point-estimate experimental results rather than any quantity that reduces to the paper's own definitions or inputs by construction. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Tanh bias scaling coefficients
axioms (1)
- standard math Standard mathematical definitions and derivatives of sigmoid and tanh functions hold.
Reference graph
Works this paper leans on
-
[1]
Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit
Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947–951, 2000
work page 2000
-
[2]
Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage 9 A PREPRINT - J ULY 4, 2024 architecture for object recognition? In 2009 IEEE 12th international conference on computer vision , pages 2146–2153. IEEE, 2009
work page 2024
-
[3]
Rectified linear units improve restricted boltzmann machines
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010
work page 2010
-
[4]
Andrew L. Maas, Awni Y . Hannun, and Andrew Y . Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013
work page 2013
-
[5]
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015
work page 2015
-
[6]
Gaussian Error Linear Units (GELUs), 2023
Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs), 2023
work page 2023
-
[7]
Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for Activation Functions, 2017
work page 2017
-
[8]
Activate or Not: Learning Customized Activation
Ningning Ma, Xiangyu Zhang, Ming Liu, and Jian Sun. Activate or Not: Learning Customized Activation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021
work page 2021
-
[9]
ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions
Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. ErfAct and Pserf: Non- monotonic Smooth Trainable Activation Functions. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6097–6105, June 2022
work page 2022
-
[10]
Mish: A Self Regularized Non-Monotonic Activation Function, 2020
Diganta Misra. Mish: A Self Regularized Non-Monotonic Activation Function, 2020
work page 2020
-
[11]
Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique
Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 784–793, 2022
work page 2022
-
[12]
Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio
Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout Networks, 2013
work page 2013
-
[13]
E-swish: Adjusting Activations to Different Network Depths, 2018
Eric Alcaide. E-swish: Adjusting Activations to Different Network Depths, 2018
work page 2018
-
[14]
Marina Adriana Mercioni and Stefan Holban. P-Swish: Activation Function with Learnable Parameters Based on Swish Activation Function in Deep Learning. In 2020 International Symposium on Electronics and Telecommuni- cations (ISETC), pages 1–4, 2020
work page 2020
-
[15]
Soft-Clipping Swish: A Novel Activation Function for Deep Learning
Marina Adriana Mercioni and Stefan Holban. Soft-Clipping Swish: A Novel Activation Function for Deep Learning. In 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI), pages 225–230, 2021
work page 2021
-
[16]
Natinai Jinsakul, Cheng-Fa Tsai, Chia-En Tsai, and Pensee Wu. Enhancement of Deep Learning in Image Classification Performance Using Xception with the Swish Activation Function for Colorectal Polyp Preliminary Screening. Mathematics, 7(12), 2019
work page 2019
-
[17]
Mingxing Tan, Ruoming Pang, and Quoc V . Le. EfficientDet: Scalable and Efficient Object Detection. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020
work page 2020
-
[18]
Skin Cancer Detection using CNN with Swish Activation Function
Misba Farheen, M Manjushree, and Manish Kumar Pandit. Skin Cancer Detection using CNN with Swish Activation Function. 2020
work page 2020
-
[19]
Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks
Steffen Eger, Paul Youssef, and Iryna Gurevych. Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018
work page 2018
-
[20]
Bias Also Matters: Bias Attribution for Deep Neural Network Explanation
Shengjie Wang, Tianyi Zhou, and Jeff Bilmes. Bias Also Matters: Bias Attribution for Deep Neural Network Explanation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6659–6667. PMLR, 09–15 Jun 2019
work page 2019
-
[21]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[22]
Deep Residual Learning for Image Recognition, 2015
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015
work page 2015
-
[23]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...
work page 2015
-
[24]
PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...
work page 2019
-
[25]
Beyond regression : new tools for prediction and analysis in the behavioral sciences /
Paul Werbos and Paul John. Beyond regression : new tools for prediction and analysis in the behavioral sciences /. 01 1974
work page 1974
-
[26]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, November 2018
work page 2018
- [27]
-
[28]
MNIST handwritten digit database
Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010
work page 2010
-
[29]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Reading Digits in Natural Images with Unsupervised Feature Learning
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. 2011
work page 2011
-
[31]
Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network.Advances in neural information processing systems, 2, 1989
work page 1989
-
[32]
H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951
work page 1951
-
[33]
J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. Annals of Mathematical Statistics, 23:462–466, 1952
work page 1952
-
[34]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[35]
Shufflenet v2: Practical guidelines for efficient cnn architecture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018
work page 2018
-
[36]
Squeeze-and-excitation networks
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018
work page 2018
-
[37]
Efficientnet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019
work page 2019
-
[38]
Mobilenetv2: Inverted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018
work page 2018
-
[39]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 11
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.