Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
hub Mixed citations
Searching for Activation Functions
Mixed citation behavior. Most common role is background (69%).
abstract
The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the
co-cited works
representative citing papers
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
Evolved multi-channel activation functions that incorporate missingness and confidence scores improve classification performance on datasets with missing data.
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.
KA-CRNNs learn pressure-dependent and collider-specific kinetic rate laws from data using Kolmogorov-Arnold activations inside a CRNN framework, outperforming interpolative methods by 2.88x in MSE on two proof-of-concept reactions.
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
Skala is a neural XC functional trained on wavefunction data that beats state-of-the-art hybrids on main-group chemistry benchmarks at semi-local computational cost.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
A generative semantic communication system that sends compressed semantic information and uses diffusion models with spatially-adaptive normalizations to reconstruct high-quality, semantically consistent images even under severe channel noise.
EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.
MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
citing papers explorer
-
Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
-
Supervised Guidance Training for Infinite-Dimensional Diffusion Models
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Neural Statistical Functions
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
-
Selectivity and Shape in the Design of Forward-Forward Goodness Functions
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
-
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
-
Evolving Multi-Channel Confidence-Aware Activation Functions for Missing Data with Channel Propagation
Evolved multi-channel activation functions that incorporate missingness and confidence scores improve classification performance on datasets with missing data.
-
Imposing Boundary Conditions on Neural Operators via Learned Function Extensions
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
-
DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations
DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.
-
Kolmogorov-Arnold Chemical Reaction Neural Networks for learning pressure-dependent kinetic rate laws
KA-CRNNs learn pressure-dependent and collider-specific kinetic rate laws from data using Kolmogorov-Arnold activations inside a CRNN framework, outperforming interpolative methods by 2.88x in MSE on two proof-of-concept reactions.
-
Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
-
Accurate and scalable exchange-correlation with deep learning
Skala is a neural XC functional trained on wavefunction data that beats state-of-the-art hybrids on main-group chemistry benchmarks at semi-local computational cost.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Generative Semantic Communication: Diffusion Models Beyond Bit Recovery
A generative semantic communication system that sends compressed semantic information and uses diffusion models with spatially-adaptive normalizations to reconstruct high-quality, semantically consistent images even under severe channel noise.
-
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
-
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
-
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.
-
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
-
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
-
On the Blessing of Pre-training in Weak-to-Strong Generalization
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
-
Competing nonlinearities, criticality, and order-to-chaos transition in deep networks
A statistical mixture of Tanh and Swish activations with critical mixing fraction p_c induces a continuous phase transition to scale-invariant signal propagation in deep networks while preserving smoothness.
-
Neural-network reconstruction of THz transmission spectra using electrically tunable AlGaN/GaN plasmonic-crystal analyzer
A feedforward neural network trained on synthetic data inverts voltage-dependent intensities from an electrically tunable AlGaN/GaN plasmonic analyzer to reconstruct THz spectra, achieving lower error than Tikhonov regularization and identifying most resonances correctly.
-
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics
EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10 with ResNet models.
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
-
Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks
A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.
-
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
-
A Complex-Valued Continuous-Variable Quantum Approximation Optimization Algorithm (CCV-QAOA)
CCV-QAOA is a new complex-valued continuous-variable variant of QAOA that solves real and complex multivariate optimization problems via a variational framework.
-
OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning
Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the Netherlands.
-
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
-
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
-
Neural simulation-based inference of the Higgs trilinear self-coupling via off-shell Higgs production
A hybrid NSBI technique is presented for inferring the Higgs trilinear coupling via off-shell production in SMEFT, achieving near-theoretical-optimum sensitivity with expected HL-LHC constraints.
-
OrderFusion: Encoding Orderbook for End-to-End Probabilistic Intraday Electricity Price Forecasting
OrderFusion encodes orderbook buy-sell interactions in an end-to-end probabilistic model for intraday electricity price forecasting with non-crossing quantiles and reports consistent gains over baselines on European CID indices.
-
When Attention Sink Emerges in Language Models: An Empirical View
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty
IDEAL is a selective dual ambulance dispatch framework that learns context-specific travel times via weakly supervised bilevel networks and models uncertainty with Burg-divergence perturbations to achieve better response-time and resource trade-offs than region-based or map-based baselines.
-
A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers
A constant-time implementation methodology for activation functions on ARM Cortex-M4 microcontrollers using branchless selection, Padé approximations, dummy arithmetic, and cycle alignment to eliminate timing side channels while preserving accuracy.
-
Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
RBMs using exponential activation functions can represent and learn data structures with strong higher-order interactions better than linear, step or ReLU activations, but only inside an analytically determined parameter window.
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices
A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.
-
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
-
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
-
GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories
GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks while running thousands of times faster than optimization solvers.
-
Physics-informed neural networks for form-finding of unilateral membrane structures
PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.
-
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications
ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
-
Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis
A three-stage self-supervised pipeline for data-efficient frame-level syllable detection in complex birdsong using a Residual MLP-RNN model.
-
Activation Function Design Sustains Plasticity in Continual Learning
Smooth-Leaky and Randomized Smooth-Leaky activations mitigate loss of plasticity in continual learning by targeting negative-branch shape and saturation behavior.