arxiv: 2604.21677 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.AI· cs.NE

Recognition: unknown

Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

Eylon E. Krause

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords activation functionsGEMGELUsmooth activationsrational arithmeticCNN transformer tradeoffimage classificationlanguage modeling

0 comments

The pith

GEM family of C^{2N}-smooth rational activations achieves ReLU-like performance with log-logistic gates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GEM, a family of activation functions that are C^{2N}-smooth, rely exclusively on rational arithmetic, and use a log-logistic cumulative distribution function as their gate. This construction aims to provide smooth gradients for optimization while matching or exceeding the performance of common choices like ReLU and GELU across image and language models. Variants include E-GEM, which uses an epsilon parameter to approximate ReLU to arbitrary L^p precision, and SE-GEM, a piecewise version that avoids dead neurons while preserving smoothness at junctions. An ablation study identifies N=1 as best for standard CNN depths, cutting the GELU accuracy gap on CIFAR-100 with ResNet-56 from 6.10 percent to 2.12 percent, and shows a tradeoff where N=2 works better for transformers. Results on MNIST, CIFAR-10, GPT-2, and BERT-small further demonstrate competitive or superior metrics depending on the variant and hyperparameters chosen.

Core claim

The paper claims that a new family of activation functions, GEM, whose gate follows a log-logistic CDF, can be made C^{2N} differentiable using only rational operations and delivers performance on par with or better than GELU, with N=1 optimal for deep CNNs, N=2 optimal for transformers, and epsilon-tuned E-GEM variants closing or reversing the GELU deficit on multiple benchmarks.

What carries the argument

The log-logistic cumulative distribution function serving as the smooth rational gate in the monomial-based activation, with integer N controlling the order of differentiability and epsilon controlling the ReLU approximation tightness.

If this is right

N=1 reduces the GELU deficit on CIFAR-100 with ResNet-56 from 6.10% to 2.12%.
N=1 is preferred for deep CNNs while N=2 is preferred for transformers.
GEM N=1 and N=2 both beat GELU perplexity on GPT-2 (73.32 and 72.57 versus 73.76).
SE-GEM with epsilon=10^{-4} exceeds GELU accuracy on CIFAR-10 ResNet-56 (92.51% versus 92.44%).
E-GEM with small epsilon narrows the GELU deficit on CIFAR-100 ResNet-56 to 0.62%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The purely rational arithmetic could allow faster evaluation on hardware that lacks fast transcendental function support.
The CNN-transformer smoothness tradeoff implies that activation design should be architecture-specific rather than universal.
Extending epsilon tuning to even larger models might reveal further scale-dependent optima.
Applying the same family to other modalities such as audio or graph networks could test whether the N tradeoff generalizes.

Load-bearing premise

That observed performance differences between activations result from the functions themselves rather than from unequal hyperparameter searches, random seeds, or training schedules.

What would settle it

Re-training all compared activations on the same datasets and architectures while using identical hyperparameter grids, random initializations, and training schedules to check whether the reported accuracy and perplexity gaps persist.

Figures

Figures reproduced from arXiv: 2604.21677 by Eylon E. Krause.

**Figure 1.** Figure 1: The family of GEM activation functions Like ReLU and all the non-negative max(0, ) types of activation functions, we can see that the GEM is bounded below and is unbounded from above since it is asymptotically same as ReLU for large x : 2 12 ( ) ( ) 1 N N N xa x x x  x x s x = − = − − → + () N  x converges to the identity at a rate controlled by N. Larger N yields faster convergence to ReLU. The powers… view at source ↗

**Figure 3.** Figure 3: E-GEM and its derivative with varying  , superimposed alongside ReLU [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: SE-GEM and its derivative with varying  , superimposed alongside ReLU and Swish [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of $C^{2N}$-smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic. We introduce three variants: GEM (the base family), E-GEM (an $\epsilon$-parameterized generalization enabling arbitrary $L^p$-approximation of ReLU), and SE-GEM (a piecewise variant eliminating dead neurons with $C^{2N}$ junction smoothness). An $N$-ablation study establishes $N=1$ as optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The smoothness parameter $N$ further reveals a CNN-transformer tradeoff: $N=1$ is preferred for deep CNNs, while $N=2$ is preferred for transformers. On MNIST, E-GEM ties the best baseline (99.23%). On CIFAR-10 + ResNet-56, SE-GEM ($\epsilon=10^{-4}$) surpasses GELU (92.51% vs 92.44%) -- the first GEM-family activation to outperform GELU. On CIFAR-100 + ResNet-56, E-GEM reduces the GELU deficit from 6.10% (GEM $N=2$) to just 0.62%. On GPT-2 (124M), GEM achieves the lowest perplexity (72.57 vs 73.76 for GELU), with GEM $N=1$ also beating GELU (73.32). On BERT-small, E-GEM ($\epsilon=10$) achieves the best validation loss (6.656) across all activations. The $\epsilon$-parameterization reveals a scale-dependent optimum: small $\epsilon$ ($10^{-4}$--$10^{-6}$) for deep CNNs and larger transformers, with the special case of small transformers (BERT-small) benefiting from large $\epsilon$ ($\epsilon=10$) due to its limited depth and unconstrained gradients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEM offers a new rational smooth activation family with small benchmark edges, but experiments may not control for training differences.

read the letter

This paper puts forward a rational family of C^{2N}-smooth activations based on the log-logistic CDF, with variants that deliver modest gains over GELU in a few vision and language benchmarks. The math construction is new, but the experimental support has some gaps. What stands out is the clean math for GEM, E-GEM with its epsilon for L^p ReLU approximation, and the SE-GEM piecewise fix for dead neurons while preserving smoothness. They also run an N-ablation that points to different optimal smoothness for CNNs versus transformers. The results include tying best on MNIST, slight wins on CIFAR-10 and CIFAR-100 with ResNet, and better perplexity on GPT-2. The experiments are the weak point. The accuracy and perplexity differences are tiny, and the abstract gives no detail on whether every baseline activation used the same hyperparameter tuning, initialization, or schedules. Without that, you can't confidently say the smoothness or rationality is driving the gains rather than setup differences. No error bars or run counts are mentioned either. This paper is for activation function researchers and practitioners who want rational smooth alternatives. A reader experimenting with network training might find the N tradeoff interesting to test further. It should go to peer review. The new family and the ablation give enough substance for referees to evaluate, even if the claims need more rigorous backing on the experimental side.

Referee Report

2 major / 2 minor

Summary. The paper proposes the GEM family of C^{2N}-smooth rational activation functions whose gating mechanism is based on the log-logistic CDF. It introduces base GEM, an epsilon-parameterized E-GEM for L^p approximation to ReLU, and piecewise SE-GEM for eliminating dead neurons while preserving smoothness. Through N-ablation and benchmark experiments, it claims N=1 is optimal for standard-depth CNNs (reducing the GELU accuracy deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%), N=2 is preferred for transformers, and specific E-GEM/SE-GEM variants match or exceed GELU on MNIST (99.23%), CIFAR-10 (92.51%), CIFAR-100 (near-GELU), GPT-2 (72.57 perplexity), and BERT-small (6.656 validation loss).

Significance. If the performance differences are robustly attributable to the activation properties, the GEM family provides a new class of high-order differentiable, purely rational activations that could improve gradient flow in deep networks while enabling exact arithmetic implementations. The reported CNN-transformer tradeoff with respect to smoothness parameter N and the scale-dependent epsilon optimum are potentially actionable findings for architecture-specific activation selection.

major comments (2)

[Experimental results and ablation studies] The central empirical claims (e.g., GEM N=1 reducing the GELU deficit from 6.10% to 2.12% on CIFAR-100 + ResNet-56, GEM achieving 72.57 perplexity on GPT-2 vs. 73.76 for GELU, and the CNN vs. transformer N tradeoff) rest on head-to-head comparisons whose validity requires identical hyperparameter grids, random seeds, initialization distributions, learning-rate schedules, and data-augmentation pipelines across all activations. The manuscript provides no explicit statement or table confirming this protocol equivalence; without it the attribution of gains to C^{2N} smoothness or rational form is not established.
[Results on image classification and language models] No error bars, standard deviations, or number of independent runs are reported for the key metrics (accuracy on CIFAR, perplexity on GPT-2, validation loss on BERT). Single-run point estimates are insufficient to support claims of superiority or optimality of particular N or epsilon values, especially given the stochasticity of deep-network training.

minor comments (2)

[Abstract and CIFAR-10 results] The abstract states that SE-GEM with epsilon=10^{-4} 'surpasses GELU (92.51% vs 92.44%)' on CIFAR-10 + ResNet-56; a table or section should clarify whether this difference is within the variability of the training process.
[Definition of the GEM family] Notation for the log-logistic CDF and the precise rational expressions for each variant (GEM, E-GEM, SE-GEM) should be given explicitly with the corresponding equations in the methods section to allow direct verification of the 'purely rational arithmetic' claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our experimental methodology. We address the major comments point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Experimental results and ablation studies] The central empirical claims (e.g., GEM N=1 reducing the GELU deficit from 6.10% to 2.12% on CIFAR-100 + ResNet-56, GEM achieving 72.57 perplexity on GPT-2 vs. 73.76 for GELU, and the CNN vs. transformer N tradeoff) rest on head-to-head comparisons whose validity requires identical hyperparameter grids, random seeds, initialization distributions, learning-rate schedules, and data-augmentation pipelines across all activations. The manuscript provides no explicit statement or table confirming this protocol equivalence; without it the attribution of gains to C^{2N} smoothness or rational form is not established.

Authors: We confirm that all comparisons were performed under identical experimental conditions, with the only difference being the activation function. The training code, hyperparameters, seeds, and pipelines were shared across all tested activations. To address this, we have added an explicit description of the experimental protocol in the revised manuscript, including a table that lists the common hyperparameters and settings used for each benchmark. This makes the equivalence clear and supports attributing the performance differences to the properties of the GEM family. revision: yes
Referee: [Results on image classification and language models] No error bars, standard deviations, or number of independent runs are reported for the key metrics (accuracy on CIFAR, perplexity on GPT-2, validation loss on BERT). Single-run point estimates are insufficient to support claims of superiority or optimality of particular N or epsilon values, especially given the stochasticity of deep-network training.

Authors: We agree that reporting variability across multiple runs would provide stronger evidence. The results in the manuscript are based on single training runs for each configuration, as is typical in many similar studies. We have updated the manuscript to explicitly state that the reported metrics are from single runs and to include a discussion of this limitation in the experimental section. However, we are unable to provide error bars without conducting additional experiments. revision: partial

standing simulated objections not resolved

Reporting error bars, standard deviations, and results from multiple independent runs for the key performance metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new functional forms, empirical tuning of N and epsilon on benchmark data, and the domain assumption that log-logistic CDF yields useful neural-network gates. No machine-checked proofs or external benchmarks are mentioned.

free parameters (2)

N
Smoothness order selected via ablation on ResNet and transformer models.
epsilon
Controls closeness to ReLU in E-GEM; values like 10^{-4} and 10 chosen per architecture.

axioms (1)

domain assumption The log-logistic CDF provides a suitable smooth gate for activation functions in deep networks.
Invoked to define the base GEM family and its differentiability properties.

invented entities (1)

GEM family of activation functions no independent evidence
purpose: Rational C^{2N}-smooth replacement for ReLU/GELU
Newly proposed functional form and variants.

pith-pipeline@v0.9.0 · 5739 in / 1515 out tokens · 45228 ms · 2026-05-09T22:30:07.537694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Rubio, R. (2025). Mathematics of Machine Learning and Machine Learning for Mathematics [Lecture Notes]. [https://app.perusall.com/courses/mathematics-of-machine-learning-and-machine-learning-for- mathematics/]

2025
[2]

Agarap, A. F. (2019). Deep Learning using Rectified Linear Units (ReLU). [http://arxiv.org/abs/1803.08375](http://arxiv.org/abs/1803.08375)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). [http://arxiv.org/abs/1511.07289](http://arxiv.org/abs/1511.07289)

work page Pith review arXiv 2016
[4]

Hendrycks, D., & Gimpel, K. (2023). Gaussian Error Linear Units (GELUs). [http://arxiv.org/abs/1606.08415](http://arxiv.org/abs/1606.08415)

work page Pith review arXiv 2023
[5]

Ramachandran, P., Zoph, B., & Le, Q. v. (2017). Swish: a Self-Gated Activation Function. [http://arxiv.org/abs/1710.05941](http://arxiv.org/abs/1710.05941)

work page internal anchor Pith review arXiv 2017
[6]

Misra, D. (2020). Mish: A Self Regularized Non-Monotonic Activation Function. [http://arxiv.org/abs/1908.08681](http://arxiv.org/abs/1908.08681)

work page arXiv 2020
[7]

Shazeer, N. (2020). GLU Variants Improve Transformer. [http://arxiv.org/abs/2002.05202](http://arxiv.org/abs/2002.05202)

work page internal anchor Pith review arXiv 2020
[8]

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. [http://arxiv.org/abs/1512.03385](http://arxiv.org/abs/1512.03385)

work page internal anchor Pith review arXiv 2016
[9]

Biswas, K., Kumar, S., Banerjee, S., & Pandey, A. K. (2022). SAU: Smooth Activation Function Using Convolution with Approximate Identities. In ECCV 2022

2022
[10]

Biswas, K., Kumar, S., Banerjee, S., & Pandey, A. K. (2022). SMU: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique. [http://arxiv.org/abs/2111.04682](http://arxiv.org/abs/2111.04682)

work page arXiv 2022
[11]

Chen, J., Bhatt, R., & Bhatt, A. (2023). Saturated Non-Monotonic Activation Functions. [http://arxiv.org/abs/2305.07537](http://arxiv.org/abs/2305.07537)

work page arXiv 2023