Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks
Pith reviewed 2026-05-19 07:38 UTC · model grok-4.3
The pith
Activations non-smooth only at zero generate equivalent RKHS for neural kernels across all depths.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For activations whose only non-smoothness is at zero, the RKHS of the NTK and NNGP kernels is equivalent across different network depths up to equivalence determined by the degree of non-smoothness.
What carries the argument
RKHS characterization extending from ReLU powers to activations non-smooth solely at zero under the infinite-width limit.
If this is right
- Equivalent RKHS across depths holds for SELU, ELU, LeakyReLU and similar activations.
- Polynomial activations produce depth-dependent RKHS.
- Smoothness of NNGP sample paths is determined by the activation.
- Special cases such as missing biases or two-layer networks follow the same pattern.
Where Pith is reading between the lines
- Choice of these activations may not change the effective function space when depth varies.
- Results could guide analysis of finite-width corrections or other kernel regimes.
- Connections appear to smoothness properties in random feature models or approximation theory.
Load-bearing premise
Activation functions have their only non-smoothness at zero.
What would settle it
Computing the RKHS or its eigenfunctions for LeakyReLU or ELU in a three-layer network versus a two-layer network and finding them inequivalent would falsify the depth-independent claim.
read the original abstract
In recent years, the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks. However, the property of these kernels are poorly understood for activation functions other than powers of the ReLU. Our main contribution is a characterization of the RKHS of these kernels for activation functions whose only non-smoothness is at zero. This extends existing theory to numerous commonly used activation functions such as SELU, ELU, or LeakyReLU. Additionally, we analyze a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, depending only on the degree of the non-smoothness up to equivalence. On the other hand, the RKHS generated by polynomial activations depends on the network depth. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper characterizes the RKHS of NNGP and NTK kernels for activations whose only non-smoothness occurs at zero (e.g., SELU, ELU, LeakyReLU), extending prior ReLU-power results. It claims these activations yield depth-independent equivalent RKHSs determined solely by the degree of non-smoothness under infinite-width limits. Special cases (missing biases, two-layer nets, polynomials) are analyzed, with polynomials shown to produce depth-dependent RKHSs. Smoothness properties of NNGP sample paths are also derived.
Significance. If the central characterization holds, the work meaningfully extends neural kernel theory to a wide range of practical activations, clarifying when depth and activation details cease to affect the limiting RKHS. This could support more principled activation selection and initialization analysis in wide networks.
major comments (1)
- [Main RKHS characterization (around the extension from ReLU powers)] The central claim that RKHS equivalence depends only on the degree of non-smoothness at zero appears to rest on the assumption that global activation asymptotics do not introduce depth-dependent terms in the kernel recursion E[ϕ(u)ϕ(v)]. However, activations like ELU (saturation for x ≪ 0) and SELU (exponential growth on one side) differ qualitatively from pure |x|^k models; without an explicit uniformity or domination argument controlling these tails under Gaussian pre-activations, the recursion may retain depth dependence. This is load-bearing for the equivalence result across depths.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from a brief explicit statement of the precise regularity condition (e.g., C^∞ away from zero with controlled derivatives) used to justify the extension.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for identifying this important point about the potential role of activation tails in the kernel recursion. We address the comment below and will make a targeted revision to strengthen the argument.
read point-by-point responses
-
Referee: The central claim that RKHS equivalence depends only on the degree of non-smoothness at zero appears to rest on the assumption that global activation asymptotics do not introduce depth-dependent terms in the kernel recursion E[ϕ(u)ϕ(v)]. However, activations like ELU (saturation for x ≪ 0) and SELU (exponential growth on one side) differ qualitatively from pure |x|^k models; without an explicit uniformity or domination argument controlling these tails under Gaussian pre-activations, the recursion may retain depth dependence. This is load-bearing for the equivalence result across depths.
Authors: We appreciate the referee highlighting the need for explicit control on tail contributions. Our proof decomposes the activation ϕ into a globally smooth part s and a part n that is non-smooth only at zero, with the RKHS equivalence determined by the singularity of n. The recursion for the covariance induced by s converges to a depth-independent limit because the pre-activation variance recursion reaches a fixed point for the activations considered (see the variance analysis preceding Theorem 3). The cross terms involving n are controlled by the local expansion near zero, which is independent of depth. Nevertheless, we agree that an explicit uniform domination bound on the tail integrals E[|s(u)s(v)| 1_{|u| or |v| large}] under the sequence of Gaussian measures would make the argument fully rigorous and address the concern directly. We will add a supporting lemma (with the required domination) in the revised version, placed in Section 3 or the appendix. revision: partial
Circularity Check
Derivation is self-contained theoretical extension
full rationale
The paper derives characterizations of the RKHS for neural kernels under infinite-width NNGP/NTK limits for activations whose only non-smoothness is at zero. This extends prior ReLU-power results via standard kernel recursion analysis without any reduction to self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The equivalence claim follows directly from the stated assumptions on local non-smoothness and Gaussian pre-activations, remaining independent of the target result.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Infinite-width limit for fully connected neural networks
- domain assumption Activation functions have non-smoothness only at zero
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 9: for 1 ≤ s < ∞, HkNNGP_L ≅ H^{d/2+s+1/2}(Sd) (and analogous NTK statement); smoothness(φ) defined via lim t↘0 φ^{(m)}(t) ≠ lim t↗0 φ^{(m)}(t)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Boundary analysis via Qα,β classes and reference activations sk(x) = (1/2k!) sgn(x) x^k leading to Δm(φ)^2 t^{m+1/2} term
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.