pith. sign in

arxiv: 2506.22429 · v2 · submitted 2025-06-27 · 📊 stat.ML · cs.LG

Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks

Pith reviewed 2026-05-19 07:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords neural tangent kernelNNGPRKHSactivation functionsinfinite-width networksnon-smooth activationsrandom wide networks
0
0 comments X

The pith

Activations non-smooth only at zero generate equivalent RKHS for neural kernels across all depths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes the reproducing kernel Hilbert spaces of neural tangent kernels and neural network Gaussian process kernels for activation functions that have their only non-smoothness at zero. This extends prior results on powers of the ReLU to activations such as SELU, ELU, and LeakyReLU. For this broad class the resulting RKHS is equivalent at different network depths, with the space fixed by the degree of non-smoothness alone. Polynomial activations instead produce RKHS that vary with depth. The work further derives the smoothness of sample paths from the NNGP, describing the regularity of infinitely wide networks at initialization.

Core claim

For activations whose only non-smoothness is at zero, the RKHS of the NTK and NNGP kernels is equivalent across different network depths up to equivalence determined by the degree of non-smoothness.

What carries the argument

RKHS characterization extending from ReLU powers to activations non-smooth solely at zero under the infinite-width limit.

If this is right

  • Equivalent RKHS across depths holds for SELU, ELU, LeakyReLU and similar activations.
  • Polynomial activations produce depth-dependent RKHS.
  • Smoothness of NNGP sample paths is determined by the activation.
  • Special cases such as missing biases or two-layer networks follow the same pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Choice of these activations may not change the effective function space when depth varies.
  • Results could guide analysis of finite-width corrections or other kernel regimes.
  • Connections appear to smoothness properties in random feature models or approximation theory.

Load-bearing premise

Activation functions have their only non-smoothness at zero.

What would settle it

Computing the RKHS or its eigenfunctions for LeakyReLU or ELU in a three-layer network versus a two-layer network and finding them inequivalent would falsify the depth-independent claim.

read the original abstract

In recent years, the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks. However, the property of these kernels are poorly understood for activation functions other than powers of the ReLU. Our main contribution is a characterization of the RKHS of these kernels for activation functions whose only non-smoothness is at zero. This extends existing theory to numerous commonly used activation functions such as SELU, ELU, or LeakyReLU. Additionally, we analyze a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, depending only on the degree of the non-smoothness up to equivalence. On the other hand, the RKHS generated by polynomial activations depends on the network depth. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper characterizes the RKHS of NNGP and NTK kernels for activations whose only non-smoothness occurs at zero (e.g., SELU, ELU, LeakyReLU), extending prior ReLU-power results. It claims these activations yield depth-independent equivalent RKHSs determined solely by the degree of non-smoothness under infinite-width limits. Special cases (missing biases, two-layer nets, polynomials) are analyzed, with polynomials shown to produce depth-dependent RKHSs. Smoothness properties of NNGP sample paths are also derived.

Significance. If the central characterization holds, the work meaningfully extends neural kernel theory to a wide range of practical activations, clarifying when depth and activation details cease to affect the limiting RKHS. This could support more principled activation selection and initialization analysis in wide networks.

major comments (1)
  1. [Main RKHS characterization (around the extension from ReLU powers)] The central claim that RKHS equivalence depends only on the degree of non-smoothness at zero appears to rest on the assumption that global activation asymptotics do not introduce depth-dependent terms in the kernel recursion E[ϕ(u)ϕ(v)]. However, activations like ELU (saturation for x ≪ 0) and SELU (exponential growth on one side) differ qualitatively from pure |x|^k models; without an explicit uniformity or domination argument controlling these tails under Gaussian pre-activations, the recursion may retain depth dependence. This is load-bearing for the equivalence result across depths.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from a brief explicit statement of the precise regularity condition (e.g., C^∞ away from zero with controlled derivatives) used to justify the extension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point about the potential role of activation tails in the kernel recursion. We address the comment below and will make a targeted revision to strengthen the argument.

read point-by-point responses
  1. Referee: The central claim that RKHS equivalence depends only on the degree of non-smoothness at zero appears to rest on the assumption that global activation asymptotics do not introduce depth-dependent terms in the kernel recursion E[ϕ(u)ϕ(v)]. However, activations like ELU (saturation for x ≪ 0) and SELU (exponential growth on one side) differ qualitatively from pure |x|^k models; without an explicit uniformity or domination argument controlling these tails under Gaussian pre-activations, the recursion may retain depth dependence. This is load-bearing for the equivalence result across depths.

    Authors: We appreciate the referee highlighting the need for explicit control on tail contributions. Our proof decomposes the activation ϕ into a globally smooth part s and a part n that is non-smooth only at zero, with the RKHS equivalence determined by the singularity of n. The recursion for the covariance induced by s converges to a depth-independent limit because the pre-activation variance recursion reaches a fixed point for the activations considered (see the variance analysis preceding Theorem 3). The cross terms involving n are controlled by the local expansion near zero, which is independent of depth. Nevertheless, we agree that an explicit uniform domination bound on the tail integrals E[|s(u)s(v)| 1_{|u| or |v| large}] under the sequence of Gaussian measures would make the argument fully rigorous and address the concern directly. We will add a supporting lemma (with the required domination) in the revised version, placed in Section 3 or the appendix. revision: partial

Circularity Check

0 steps flagged

Derivation is self-contained theoretical extension

full rationale

The paper derives characterizations of the RKHS for neural kernels under infinite-width NNGP/NTK limits for activations whose only non-smoothness is at zero. This extends prior ReLU-power results via standard kernel recursion analysis without any reduction to self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The equivalence claim follows directly from the stated assumptions on local non-smoothness and Gaussian pre-activations, remaining independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard infinite-width limit for fully connected networks and the assumption that activations are non-smooth only at zero. No free parameters, new entities, or ad-hoc axioms are indicated in the abstract.

axioms (2)
  • domain assumption Infinite-width limit for fully connected neural networks
    Standard background assumption in NTK and NNGP theory invoked to obtain the kernel limits.
  • domain assumption Activation functions have non-smoothness only at zero
    Explicit premise used to extend the RKHS characterization beyond ReLU powers.

pith-pipeline@v0.9.0 · 5719 in / 1332 out tokens · 37711 ms · 2026-05-19T07:38:15.636847+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.