Deep Learning using Rectified Linear Units (ReLU)

Abien Fred Agarap

arxiv: 1803.08375 · v3 · submitted 2018-03-22 · 💻 cs.NE · cs.CV· cs.LG· stat.ML

Deep Learning using Rectified Linear Units (ReLU)

Abien Fred Agarap This is my paper

classification 💻 cs.NE cs.CVcs.LGstat.ML

keywords reluclassificationfunctionsdeepimagelineartanhtasks

0 comments

read the original abstract

The Rectified Linear Unit (ReLU) is a foundational activation function in artficial neural networks. Recent literature frequently misattributes its origin to the 2018 (initial) version of this paper, which exclusively investigated ReLU at the classification layer. This paper formally corrects the citation record by tracing the mathematical lineage of piecewise linear functions from early biological models to their definitive integration into deep learning by Nair & Hinton (2010). Alongside this historical rectification, we present a comprehensive empirical comparison of the ReLU, Hyperbolic Tangent (Tanh), and Logistic (Sigmoid) activation functions across image classification, text classification, and image reconstruction tasks. To ensure statistical robustness, we evaluated these functions using 10 independent randomized trials and assessed significance using the non-parametric Kruskal-Wallis $H$ test. The empirical data validates the theoretical limitations of saturating functions. Sigmoid failed to converge in deep convolutional vision tasks due to the vanishing gradient problem, thus yielding accuracies equivalent to random probability. Conversely, ReLU and Tanh exhibited stable convergence. ReLU achieved the highest mean accuracy and F1-score on image classification and text classification tasks, while Tanh yielded the highest peak signal to noise ratio in image reconstruction. Ultimately, this study confirms a statistically significant performance variance among activations, thus reaffirming the necessity of non-saturating functions in deep architectures, and restores proper historical attribution to prior literature.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning
quant-ph 2026-05 unverdicted novelty 7.0

QAP-Router models qubit routing as dynamic QAP and applies RL with a solution-aware Transformer to cut CNOT counts by 12-30% versus industry compilers on real circuit benchmarks.
Galaxy clusters in the LoTSS-DR3: Catalogues and detection pipeline for diffuse radio emission
astro-ph.CO 2026-05 unverdicted novelty 7.0

A Radio U-Net pipeline produces pixel-level segmentation maps and probability scores for diffuse radio emission in 3822 galaxy clusters from LoTSS-DR3, yielding a high-confidence sample of 357 and confirming trends wi...
Machine learning isotope shifts in molecular energy levels
astro-ph.EP 2026-04 unverdicted novelty 7.0

Neural network corrects residual errors in isotopologue energy extrapolations for CO2 (MAE reduction in >87% of levels vs Marvel) and transfers patterns to improve CO predictions in >93% of samples.
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
One-Step Score-Based Density Ratio Estimation
stat.ML 2026-04 unverdicted novelty 7.0

OS-DRE performs score-based density ratio estimation in one step by approximating the temporal score component with a closed-form RBF frame and providing error bounds from approximation theory.
Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters
cs.AI 2026-04 unverdicted novelty 7.0

A hybrid neural policy operating in impulse space enables physics-based characters to track exaggerated, dynamically infeasible motions that standard DRL methods cannot stabilize.
Two-stage Convolutional Neural Network for pseudo six-dimensional phase space reconstruction
hep-ex 2026-03 unverdicted novelty 7.0

A two-stage CNN reconstructs pseudo 6D phase space from 16 x-y images taken at varying rotation angles in the KEK-ATF injector.
PILL-CoDe: Inverse Design of Polypills via Automatic Differentiation for Prescribed Drug-Release Kinetics
cs.CE 2025-12 conditional novelty 7.0

PILL-CoDe co-optimizes polypill geometry via supershapes and excipient maps via neural networks to match target drug-release curves using end-to-end differentiable modified Allen-Cahn and Fickian diffusion models.
Deep Learning for CMB Foreground Removal and Beam Deconvolution: A U-Net GAN Approach
astro-ph.IM 2025-08 unverdicted novelty 7.0

A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular be...
Moonwalk: Inverse-Forward Differentiation
cs.LG 2024-02 unverdicted novelty 7.0

Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop run...
Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning
cs.CL 2026-05 unverdicted novelty 6.0

A distributional alignment metric d_NTP and a linear regression method LTV for task vectors that improves accuracy by 9.2% over baselines on classification and regression tasks across multiple LLMs.
Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains
cs.CL 2026-05 unverdicted novelty 6.0

TopoPrior learns transferable topology priors offline from multi-domain reference graphs using a conditional variational graph model and adversarial adaptation to initialize collaboration structures for multi-agent LL...
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
cs.LG 2026-04 unverdicted novelty 6.0

GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
cs.CV 2026-04 unverdicted novelty 6.0

Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
Force Field-Agnostic Phase Classification of Zeolitic Imidazolate Framework Polymorphs
cond-mat.mtrl-sci 2026-04 unverdicted novelty 6.0

Neural networks trained on molecular configurations from different force fields classify ZIF polymorph phases accurately in simulations and expose transition mechanisms without force-field bias.
Accelerating 4D Hyperspectral Imaging through Physics-Informed Neural Representation and Adaptive Sampling
eess.IV 2026-04 unverdicted novelty 6.0

A physics-informed MLP reconstructs high-fidelity 4D spectra from only 1/32 of the samples in experimental 2DIR hyperspectral imaging.
Crowding Out The Noise: Algorithmic Collective Action Under Differential Privacy
cs.LG 2025-05 unverdicted novelty 6.0

Differential privacy reduces algorithmic collective action effectiveness, with formal lower bounds on success probability depending on collective size and privacy parameters, plus experimental verification on neural nets.
The Optical and Infrared Are Connected
astro-ph.GA 2025-03 unverdicted novelty 6.0

Neural network predicts WISE IR photometry from SDSS optical spectra with χ²_N ≈1 accuracy and identifies biases in SED-fitting methods for AGN and dust properties.
NaviSplit: Dynamic Multi-Branch Split DNNs for Efficient Distributed Autonomous Navigation
cs.RO 2024-06 unverdicted novelty 6.0

NaviSplit introduces a dynamic multi-branch split DNN framework for UAV navigation that runs perception on-device and control on-edge, achieving 72-81% depth accuracy with 1.2-18 KB transmissions and 95% lower data ra...
Using deep learning to construct stochastic local search SAT solvers with performance bounds
cs.AI 2023-09 unverdicted novelty 6.0

Graph neural networks trained as oracles improve step counts and solved instances for stochastic local search SAT solvers on random and pseudo-industrial benchmarks while preserving theoretical bounds.
Machine Learning Enhanced Laser Spectroscopy for Multi-Species Gas Detection in Complex and Harsh Environments
physics.optics 2026-05 unverdicted novelty 5.0

Machine learning methods including denoising autoencoders, unsupervised interference mitigation, blind source separation, and certifiable classification are developed and experimentally validated to improve multi-spec...
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
cs.CV 2026-04 unverdicted novelty 5.0

XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
A Multi-head Attention Fusion Network for Industrial Prognostics under Discrete Operational Conditions
cs.LG 2026-04 unverdicted novelty 5.0

A multi-head attention fusion network integrates monotonic degradation trends, discrete operating state embeddings from clustering, and residual noise using BiLSTM and attention mechanisms to improve prognostic accura...
Defending against Backdoor Attacks via Module Switching
cs.CR 2025-04 unverdicted novelty 5.0

Module-switching defense disrupts backdoors more effectively than weight averaging with fewer models and remains robust even when some models share the same backdoors.
Gamma-Ray Burst Light Curve Reconstruction: A Comparative Machine and Deep Learning Analysis
astro-ph.HE 2024-12 unverdicted novelty 5.0

MLP and Attention U-Net outperform other models in reconstructing GRB light curves on 521 events, cutting plateau parameter uncertainties by 37-41% versus the Willingale baseline while achieving low MSE.
Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions
cs.AI 2026-04 unverdicted novelty 4.0

Agentic AI platforms autonomously train 87%-accurate PPI prediction models on protein-disjoint data and induce aligning human-readable rules for human-human and virus-human interactions.
Learning to count small and clustered objects with application to bacterial colonies
cs.CV 2026-04 unverdicted novelty 4.0

ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
An Implementation to Identify the Properties of Multiple Population of Gravitational Wave Sources
gr-qc 2025-09 unverdicted novelty 4.0

GWKokab is a new modular JAX framework that uses normalizing flow samplers for efficient inference on subpopulations of compact binary mergers.
Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
cs.LG 2025-09 unverdicted novelty 4.0

Fine-tuned LLaMA 3.2 VLM outperforms CNN baselines on neutrino event classification while adding interpretability via language reasoning.
Nonlinear Transformations Against Unlearnable Datasets
cs.LG 2024-06 unverdicted novelty 4.0

Nonlinear transformations enable DNNs to achieve substantial test accuracy gains (0.34% to 249.59%) on unlearnable CIFAR10 datasets from twelve protection methods, outperforming a recent linear baseline.
Robustness Analysis of USmorph: II. Optimizing Feature Extraction, Dimensionality Reduction, and Clustering for Unsupervised Galaxy Morphology Classification
astro-ph.GA 2026-05 unverdicted novelty 3.0

Optimizes ImageNet-pretrained AlexNet, UMAP, and a bagging multi-cluster voting scheme with K-means, Birch and Agg for unsupervised galaxy morphology classification, reporting improved stability and consistency with g...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 31 Pith papers · 2 internal anchors

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,...

work page
[2]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org

work page 2015
[3]

Abien Fred Agarap. 2017. A Neural Network Architecture Combining Gated Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection in Network Traffic Data. arXiv preprint arXiv:1709.03082 (2017)

work page arXiv 2017
[4]

Abdulrahman Alalshekmubarak and Leslie S Smith. 2013. A novel approach combining recurrent neural network and support vector machines for time series classification. In Innovations in Information Technology (IIT), 2013 9th International Conference on. IEEE, 42–47

work page 2013
[5]

François Chollet et al. 2015. Keras. https://github.com/keras-team/keras. (2015)

work page 2015
[6]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. InAdvances in Neural Information Processing Systems . 577–585

work page 2015
[7]

Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947

work page 2000
[8]

J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55

work page doi:10.1109/mcse.2007.55 2007
[9]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- tion with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105

work page 2012
[11]

Yann LeCun, Corinna Cortes, and Christopher JC Burges. 2010. MNIST hand- written digit database. AT&T Labs [Online]. A vailable: http://yann. lecun. com/exd- b/mnist 2 (2010)

work page 2010
[12]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011
[13]

Yichuan Tang. 2013. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)

work page Pith review arXiv 2013
[14]

Ludovic Trottier, Philippe Gigu, Brahim Chaib-draa, et al . 2017. Parametric exponential linear unit for deep convolutional neural networks. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on . IEEE, 207–214

work page 2017
[15]

Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13, 2 (2011), 22–30

work page 2011
[16]

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 (2015)

work page arXiv 2015
[17]

William H Wolberg, W Nick Street, and Olvi L Mangasarian. 1992. Breast cancer Wisconsin (diagnostic) data set. UCI Machine Learning Repository [http://archive. ics. uci. edu/ml/] (1992)

work page 1992
[18]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. (2017). arXiv:cs.LG/1708.07747

work page internal anchor Pith review arXiv 2017
[19]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Ed- uard H Hovy. 2016. Hierarchical Attention Networks for Document Classification.. In HLT-NAACL. 1480–1489

work page 2016

[1] [1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,...

work page

[2] [2]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org

work page 2015

[3] [3]

Abien Fred Agarap. 2017. A Neural Network Architecture Combining Gated Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection in Network Traffic Data. arXiv preprint arXiv:1709.03082 (2017)

work page arXiv 2017

[4] [4]

Abdulrahman Alalshekmubarak and Leslie S Smith. 2013. A novel approach combining recurrent neural network and support vector machines for time series classification. In Innovations in Information Technology (IIT), 2013 9th International Conference on. IEEE, 42–47

work page 2013

[5] [5]

François Chollet et al. 2015. Keras. https://github.com/keras-team/keras. (2015)

work page 2015

[6] [6]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. InAdvances in Neural Information Processing Systems . 577–585

work page 2015

[7] [7]

Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947

work page 2000

[8] [8]

J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55

work page doi:10.1109/mcse.2007.55 2007

[9] [9]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[10] [10]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- tion with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105

work page 2012

[11] [11]

Yann LeCun, Corinna Cortes, and Christopher JC Burges. 2010. MNIST hand- written digit database. AT&T Labs [Online]. A vailable: http://yann. lecun. com/exd- b/mnist 2 (2010)

work page 2010

[12] [12]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011

[13] [13]

Yichuan Tang. 2013. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)

work page Pith review arXiv 2013

[14] [14]

Ludovic Trottier, Philippe Gigu, Brahim Chaib-draa, et al . 2017. Parametric exponential linear unit for deep convolutional neural networks. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on . IEEE, 207–214

work page 2017

[15] [15]

Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13, 2 (2011), 22–30

work page 2011

[16] [16]

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 (2015)

work page arXiv 2015

[17] [17]

William H Wolberg, W Nick Street, and Olvi L Mangasarian. 1992. Breast cancer Wisconsin (diagnostic) data set. UCI Machine Learning Repository [http://archive. ics. uci. edu/ml/] (1992)

work page 1992

[18] [18]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. (2017). arXiv:cs.LG/1708.07747

work page internal anchor Pith review arXiv 2017

[19] [19]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Ed- uard H Hovy. 2016. Hierarchical Attention Networks for Document Classification.. In HLT-NAACL. 1480–1489

work page 2016