Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks

David Holzm\"uller; Max Sch\"olpple

arxiv: 2506.22429 · v2 · submitted 2025-06-27 · 📊 stat.ML · cs.LG

Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks

David Holzm\"uller , Max Sch\"olpple This is my paper

Pith reviewed 2026-05-19 07:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords neural tangent kernelNNGPRKHSactivation functionsinfinite-width networksnon-smooth activationsrandom wide networks

0 comments

The pith

Activations non-smooth only at zero generate equivalent RKHS for neural kernels across all depths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes the reproducing kernel Hilbert spaces of neural tangent kernels and neural network Gaussian process kernels for activation functions that have their only non-smoothness at zero. This extends prior results on powers of the ReLU to activations such as SELU, ELU, and LeakyReLU. For this broad class the resulting RKHS is equivalent at different network depths, with the space fixed by the degree of non-smoothness alone. Polynomial activations instead produce RKHS that vary with depth. The work further derives the smoothness of sample paths from the NNGP, describing the regularity of infinitely wide networks at initialization.

Core claim

For activations whose only non-smoothness is at zero, the RKHS of the NTK and NNGP kernels is equivalent across different network depths up to equivalence determined by the degree of non-smoothness.

What carries the argument

RKHS characterization extending from ReLU powers to activations non-smooth solely at zero under the infinite-width limit.

If this is right

Equivalent RKHS across depths holds for SELU, ELU, LeakyReLU and similar activations.
Polynomial activations produce depth-dependent RKHS.
Smoothness of NNGP sample paths is determined by the activation.
Special cases such as missing biases or two-layer networks follow the same pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Choice of these activations may not change the effective function space when depth varies.
Results could guide analysis of finite-width corrections or other kernel regimes.
Connections appear to smoothness properties in random feature models or approximation theory.

Load-bearing premise

Activation functions have their only non-smoothness at zero.

What would settle it

Computing the RKHS or its eigenfunctions for LeakyReLU or ELU in a three-layer network versus a two-layer network and finding them inequivalent would falsify the depth-independent claim.

read the original abstract

In recent years, the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks. However, the property of these kernels are poorly understood for activation functions other than powers of the ReLU. Our main contribution is a characterization of the RKHS of these kernels for activation functions whose only non-smoothness is at zero. This extends existing theory to numerous commonly used activation functions such as SELU, ELU, or LeakyReLU. Additionally, we analyze a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, depending only on the degree of the non-smoothness up to equivalence. On the other hand, the RKHS generated by polynomial activations depends on the network depth. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper characterizes RKHS equivalence for activations non-smooth only at zero, showing depth independence tied to kink order, which extends ReLU results to ELU, SELU and LeakyReLU.

read the letter

The main thing to know is that this paper gives a characterization of the RKHS for NNGP and NTK kernels when the activation is smooth except at zero. For those functions the space turns out to be the same at every depth and depends only on the order of the non-smoothness. That covers a lot of practical choices like ELU, SELU and LeakyReLU, and it is a direct extension of the ReLU-power cases that have dominated the literature so far. They also note that polynomial activations behave differently and keep depth dependence, plus they derive smoothness properties for the sample paths of the limiting Gaussian processes.

Referee Report

1 major / 1 minor

Summary. The paper characterizes the RKHS of NNGP and NTK kernels for activations whose only non-smoothness occurs at zero (e.g., SELU, ELU, LeakyReLU), extending prior ReLU-power results. It claims these activations yield depth-independent equivalent RKHSs determined solely by the degree of non-smoothness under infinite-width limits. Special cases (missing biases, two-layer nets, polynomials) are analyzed, with polynomials shown to produce depth-dependent RKHSs. Smoothness properties of NNGP sample paths are also derived.

Significance. If the central characterization holds, the work meaningfully extends neural kernel theory to a wide range of practical activations, clarifying when depth and activation details cease to affect the limiting RKHS. This could support more principled activation selection and initialization analysis in wide networks.

major comments (1)

[Main RKHS characterization (around the extension from ReLU powers)] The central claim that RKHS equivalence depends only on the degree of non-smoothness at zero appears to rest on the assumption that global activation asymptotics do not introduce depth-dependent terms in the kernel recursion E[ϕ(u)ϕ(v)]. However, activations like ELU (saturation for x ≪ 0) and SELU (exponential growth on one side) differ qualitatively from pure |x|^k models; without an explicit uniformity or domination argument controlling these tails under Gaussian pre-activations, the recursion may retain depth dependence. This is load-bearing for the equivalence result across depths.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of the precise regularity condition (e.g., C^∞ away from zero with controlled derivatives) used to justify the extension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point about the potential role of activation tails in the kernel recursion. We address the comment below and will make a targeted revision to strengthen the argument.

read point-by-point responses

Referee: The central claim that RKHS equivalence depends only on the degree of non-smoothness at zero appears to rest on the assumption that global activation asymptotics do not introduce depth-dependent terms in the kernel recursion E[ϕ(u)ϕ(v)]. However, activations like ELU (saturation for x ≪ 0) and SELU (exponential growth on one side) differ qualitatively from pure |x|^k models; without an explicit uniformity or domination argument controlling these tails under Gaussian pre-activations, the recursion may retain depth dependence. This is load-bearing for the equivalence result across depths.

Authors: We appreciate the referee highlighting the need for explicit control on tail contributions. Our proof decomposes the activation ϕ into a globally smooth part s and a part n that is non-smooth only at zero, with the RKHS equivalence determined by the singularity of n. The recursion for the covariance induced by s converges to a depth-independent limit because the pre-activation variance recursion reaches a fixed point for the activations considered (see the variance analysis preceding Theorem 3). The cross terms involving n are controlled by the local expansion near zero, which is independent of depth. Nevertheless, we agree that an explicit uniform domination bound on the tail integrals E[|s(u)s(v)| 1_{|u| or |v| large}] under the sequence of Gaussian measures would make the argument fully rigorous and address the concern directly. We will add a supporting lemma (with the required domination) in the revised version, placed in Section 3 or the appendix. revision: partial

Circularity Check

0 steps flagged

Derivation is self-contained theoretical extension

full rationale

The paper derives characterizations of the RKHS for neural kernels under infinite-width NNGP/NTK limits for activations whose only non-smoothness is at zero. This extends prior ReLU-power results via standard kernel recursion analysis without any reduction to self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The equivalence claim follows directly from the stated assumptions on local non-smoothness and Gaussian pre-activations, remaining independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard infinite-width limit for fully connected networks and the assumption that activations are non-smooth only at zero. No free parameters, new entities, or ad-hoc axioms are indicated in the abstract.

axioms (2)

domain assumption Infinite-width limit for fully connected neural networks
Standard background assumption in NTK and NNGP theory invoked to obtain the kernel limits.
domain assumption Activation functions have non-smoothness only at zero
Explicit premise used to extend the RKHS characterization beyond ReLU powers.

pith-pipeline@v0.9.0 · 5719 in / 1332 out tokens · 37711 ms · 2026-05-19T07:38:15.636847+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 9: for 1 ≤ s < ∞, HkNNGP_L ≅ H^{d/2+s+1/2}(Sd) (and analogous NTK statement); smoothness(φ) defined via lim t↘0 φ^{(m)}(t) ≠ lim t↗0 φ^{(m)}(t)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Boundary analysis via Qα,β classes and reference activations sk(x) = (1/2k!) sgn(x) x^k leading to Δm(φ)^2 t^{m+1/2} term

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.