pith. machine review for the scientific record. sign in

arxiv: 2604.21677 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.AI· cs.NE

Recognition: unknown

Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords activation functionsGEMGELUsmooth activationsrational arithmeticCNN transformer tradeoffimage classificationlanguage modeling
0
0 comments X

The pith

GEM family of C^{2N}-smooth rational activations achieves ReLU-like performance with log-logistic gates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GEM, a family of activation functions that are C^{2N}-smooth, rely exclusively on rational arithmetic, and use a log-logistic cumulative distribution function as their gate. This construction aims to provide smooth gradients for optimization while matching or exceeding the performance of common choices like ReLU and GELU across image and language models. Variants include E-GEM, which uses an epsilon parameter to approximate ReLU to arbitrary L^p precision, and SE-GEM, a piecewise version that avoids dead neurons while preserving smoothness at junctions. An ablation study identifies N=1 as best for standard CNN depths, cutting the GELU accuracy gap on CIFAR-100 with ResNet-56 from 6.10 percent to 2.12 percent, and shows a tradeoff where N=2 works better for transformers. Results on MNIST, CIFAR-10, GPT-2, and BERT-small further demonstrate competitive or superior metrics depending on the variant and hyperparameters chosen.

Core claim

The paper claims that a new family of activation functions, GEM, whose gate follows a log-logistic CDF, can be made C^{2N} differentiable using only rational operations and delivers performance on par with or better than GELU, with N=1 optimal for deep CNNs, N=2 optimal for transformers, and epsilon-tuned E-GEM variants closing or reversing the GELU deficit on multiple benchmarks.

What carries the argument

The log-logistic cumulative distribution function serving as the smooth rational gate in the monomial-based activation, with integer N controlling the order of differentiability and epsilon controlling the ReLU approximation tightness.

If this is right

  • N=1 reduces the GELU deficit on CIFAR-100 with ResNet-56 from 6.10% to 2.12%.
  • N=1 is preferred for deep CNNs while N=2 is preferred for transformers.
  • GEM N=1 and N=2 both beat GELU perplexity on GPT-2 (73.32 and 72.57 versus 73.76).
  • SE-GEM with epsilon=10^{-4} exceeds GELU accuracy on CIFAR-10 ResNet-56 (92.51% versus 92.44%).
  • E-GEM with small epsilon narrows the GELU deficit on CIFAR-100 ResNet-56 to 0.62%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The purely rational arithmetic could allow faster evaluation on hardware that lacks fast transcendental function support.
  • The CNN-transformer smoothness tradeoff implies that activation design should be architecture-specific rather than universal.
  • Extending epsilon tuning to even larger models might reveal further scale-dependent optima.
  • Applying the same family to other modalities such as audio or graph networks could test whether the N tradeoff generalizes.

Load-bearing premise

That observed performance differences between activations result from the functions themselves rather than from unequal hyperparameter searches, random seeds, or training schedules.

What would settle it

Re-training all compared activations on the same datasets and architectures while using identical hyperparameter grids, random initializations, and training schedules to check whether the reported accuracy and perplexity gaps persist.

Figures

Figures reproduced from arXiv: 2604.21677 by Eylon E. Krause.

Figure 1
Figure 1. Figure 1: The family of GEM activation functions Like ReLU and all the non-negative max(0, ) types of activation functions, we can see that the GEM is bounded below and is unbounded from above since it is asymptotically same as ReLU for large x : 2 12 ( ) ( ) 1 N N N xa x x x  x x s x = − = − − → + () N  x converges to the identity at a rate controlled by N. Larger N yields faster convergence to ReLU. The powers… view at source ↗
Figure 3
Figure 3. Figure 3: E-GEM and its derivative with varying  , superimposed alongside ReLU [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SE-GEM and its derivative with varying  , superimposed alongside ReLU and Swish [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
read the original abstract

The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of $C^{2N}$-smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic. We introduce three variants: GEM (the base family), E-GEM (an $\epsilon$-parameterized generalization enabling arbitrary $L^p$-approximation of ReLU), and SE-GEM (a piecewise variant eliminating dead neurons with $C^{2N}$ junction smoothness). An $N$-ablation study establishes $N=1$ as optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The smoothness parameter $N$ further reveals a CNN-transformer tradeoff: $N=1$ is preferred for deep CNNs, while $N=2$ is preferred for transformers. On MNIST, E-GEM ties the best baseline (99.23%). On CIFAR-10 + ResNet-56, SE-GEM ($\epsilon=10^{-4}$) surpasses GELU (92.51% vs 92.44%) -- the first GEM-family activation to outperform GELU. On CIFAR-100 + ResNet-56, E-GEM reduces the GELU deficit from 6.10% (GEM $N=2$) to just 0.62%. On GPT-2 (124M), GEM achieves the lowest perplexity (72.57 vs 73.76 for GELU), with GEM $N=1$ also beating GELU (73.32). On BERT-small, E-GEM ($\epsilon=10$) achieves the best validation loss (6.656) across all activations. The $\epsilon$-parameterization reveals a scale-dependent optimum: small $\epsilon$ ($10^{-4}$--$10^{-6}$) for deep CNNs and larger transformers, with the special case of small transformers (BERT-small) benefiting from large $\epsilon$ ($\epsilon=10$) due to its limited depth and unconstrained gradients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the GEM family of C^{2N}-smooth rational activation functions whose gating mechanism is based on the log-logistic CDF. It introduces base GEM, an epsilon-parameterized E-GEM for L^p approximation to ReLU, and piecewise SE-GEM for eliminating dead neurons while preserving smoothness. Through N-ablation and benchmark experiments, it claims N=1 is optimal for standard-depth CNNs (reducing the GELU accuracy deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%), N=2 is preferred for transformers, and specific E-GEM/SE-GEM variants match or exceed GELU on MNIST (99.23%), CIFAR-10 (92.51%), CIFAR-100 (near-GELU), GPT-2 (72.57 perplexity), and BERT-small (6.656 validation loss).

Significance. If the performance differences are robustly attributable to the activation properties, the GEM family provides a new class of high-order differentiable, purely rational activations that could improve gradient flow in deep networks while enabling exact arithmetic implementations. The reported CNN-transformer tradeoff with respect to smoothness parameter N and the scale-dependent epsilon optimum are potentially actionable findings for architecture-specific activation selection.

major comments (2)
  1. [Experimental results and ablation studies] The central empirical claims (e.g., GEM N=1 reducing the GELU deficit from 6.10% to 2.12% on CIFAR-100 + ResNet-56, GEM achieving 72.57 perplexity on GPT-2 vs. 73.76 for GELU, and the CNN vs. transformer N tradeoff) rest on head-to-head comparisons whose validity requires identical hyperparameter grids, random seeds, initialization distributions, learning-rate schedules, and data-augmentation pipelines across all activations. The manuscript provides no explicit statement or table confirming this protocol equivalence; without it the attribution of gains to C^{2N} smoothness or rational form is not established.
  2. [Results on image classification and language models] No error bars, standard deviations, or number of independent runs are reported for the key metrics (accuracy on CIFAR, perplexity on GPT-2, validation loss on BERT). Single-run point estimates are insufficient to support claims of superiority or optimality of particular N or epsilon values, especially given the stochasticity of deep-network training.
minor comments (2)
  1. [Abstract and CIFAR-10 results] The abstract states that SE-GEM with epsilon=10^{-4} 'surpasses GELU (92.51% vs 92.44%)' on CIFAR-10 + ResNet-56; a table or section should clarify whether this difference is within the variability of the training process.
  2. [Definition of the GEM family] Notation for the log-logistic CDF and the precise rational expressions for each variant (GEM, E-GEM, SE-GEM) should be given explicitly with the corresponding equations in the methods section to allow direct verification of the 'purely rational arithmetic' claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our experimental methodology. We address the major comments point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Experimental results and ablation studies] The central empirical claims (e.g., GEM N=1 reducing the GELU deficit from 6.10% to 2.12% on CIFAR-100 + ResNet-56, GEM achieving 72.57 perplexity on GPT-2 vs. 73.76 for GELU, and the CNN vs. transformer N tradeoff) rest on head-to-head comparisons whose validity requires identical hyperparameter grids, random seeds, initialization distributions, learning-rate schedules, and data-augmentation pipelines across all activations. The manuscript provides no explicit statement or table confirming this protocol equivalence; without it the attribution of gains to C^{2N} smoothness or rational form is not established.

    Authors: We confirm that all comparisons were performed under identical experimental conditions, with the only difference being the activation function. The training code, hyperparameters, seeds, and pipelines were shared across all tested activations. To address this, we have added an explicit description of the experimental protocol in the revised manuscript, including a table that lists the common hyperparameters and settings used for each benchmark. This makes the equivalence clear and supports attributing the performance differences to the properties of the GEM family. revision: yes

  2. Referee: [Results on image classification and language models] No error bars, standard deviations, or number of independent runs are reported for the key metrics (accuracy on CIFAR, perplexity on GPT-2, validation loss on BERT). Single-run point estimates are insufficient to support claims of superiority or optimality of particular N or epsilon values, especially given the stochasticity of deep-network training.

    Authors: We agree that reporting variability across multiple runs would provide stronger evidence. The results in the manuscript are based on single training runs for each configuration, as is typical in many similar studies. We have updated the manuscript to explicitly state that the reported metrics are from single runs and to include a discussion of this limitation in the experimental section. However, we are unable to provide error bars without conducting additional experiments. revision: partial

standing simulated objections not resolved
  • Reporting error bars, standard deviations, and results from multiple independent runs for the key performance metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new functional forms, empirical tuning of N and epsilon on benchmark data, and the domain assumption that log-logistic CDF yields useful neural-network gates. No machine-checked proofs or external benchmarks are mentioned.

free parameters (2)
  • N
    Smoothness order selected via ablation on ResNet and transformer models.
  • epsilon
    Controls closeness to ReLU in E-GEM; values like 10^{-4} and 10 chosen per architecture.
axioms (1)
  • domain assumption The log-logistic CDF provides a suitable smooth gate for activation functions in deep networks.
    Invoked to define the base GEM family and its differentiability properties.
invented entities (1)
  • GEM family of activation functions no independent evidence
    purpose: Rational C^{2N}-smooth replacement for ReLU/GELU
    Newly proposed functional form and variants.

pith-pipeline@v0.9.0 · 5739 in / 1515 out tokens · 45228 ms · 2026-05-09T22:30:07.537694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Rubio, R. (2025). Mathematics of Machine Learning and Machine Learning for Mathematics [Lecture Notes]. [https://app.perusall.com/courses/mathematics-of-machine-learning-and-machine-learning-for- mathematics/]

  2. [2]

    Agarap, A. F. (2019). Deep Learning using Rectified Linear Units (ReLU). [http://arxiv.org/abs/1803.08375](http://arxiv.org/abs/1803.08375)

  3. [3]

    Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). [http://arxiv.org/abs/1511.07289](http://arxiv.org/abs/1511.07289)

  4. [4]

    Hendrycks, D., & Gimpel, K. (2023). Gaussian Error Linear Units (GELUs). [http://arxiv.org/abs/1606.08415](http://arxiv.org/abs/1606.08415)

  5. [5]

    Ramachandran, P., Zoph, B., & Le, Q. v. (2017). Swish: a Self-Gated Activation Function. [http://arxiv.org/abs/1710.05941](http://arxiv.org/abs/1710.05941)

  6. [6]

    Misra, D. (2020). Mish: A Self Regularized Non-Monotonic Activation Function. [http://arxiv.org/abs/1908.08681](http://arxiv.org/abs/1908.08681)

  7. [7]

    Shazeer, N. (2020). GLU Variants Improve Transformer. [http://arxiv.org/abs/2002.05202](http://arxiv.org/abs/2002.05202)

  8. [8]

    He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. [http://arxiv.org/abs/1512.03385](http://arxiv.org/abs/1512.03385)

  9. [9]

    Biswas, K., Kumar, S., Banerjee, S., & Pandey, A. K. (2022). SAU: Smooth Activation Function Using Convolution with Approximate Identities. In ECCV 2022

  10. [10]

    Biswas, K., Kumar, S., Banerjee, S., & Pandey, A. K. (2022). SMU: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique. [http://arxiv.org/abs/2111.04682](http://arxiv.org/abs/2111.04682)

  11. [11]

    Chen, J., Bhatt, R., & Bhatt, A. (2023). Saturated Non-Monotonic Activation Functions. [http://arxiv.org/abs/2305.07537](http://arxiv.org/abs/2305.07537)