Fast generalization error bound of deep learning without scale invariance of activation functions

Ryoma Hirose; Yoshikazu Terada

arxiv: 1907.10900 · v1 · pith:GFIN725Enew · submitted 2019-07-25 · 📊 stat.ML · cs.LG

Fast generalization error bound of deep learning without scale invariance of activation functions

Yoshikazu Terada , Ryoma Hirose This is my paper

Pith reviewed 2026-05-24 16:13 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords deep neural networksgeneralization error boundsactivation functionsfast learning ratesscale invariancestatistical learning theory

0 comments

The pith

Deep neural networks achieve fast generalization bounds without scale-invariant activation functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the tight fast generalization error bound for deep neural networks can be derived without assuming scale invariance of the activation functions. Using an existing analysis framework, it obtains essentially the same bound for activations that lack this property, such as sigmoid, hyperbolic tangent, and exponential linear unit. This matters to a sympathetic reader because it removes a condition that had appeared to restrict which networks could be shown to converge faster than the usual rate of order one over square root of sample size. The result shows the framework applies more broadly to general activation functions.

Core claim

Without the scale invariance of activation functions, the tight generalization error bound which is essentially the same as that obtained under the scale invariance assumption is derived, showing that the invariance is not essential to obtain the fast rate of convergence in this analysis framework.

What carries the argument

The generalization error analysis framework that produces tight bounds, applied directly to deep networks whose activations lack scale invariance.

If this is right

The fast convergence rate applies to networks that use common non-invariant activations such as sigmoid, tanh, and ELU.
The analysis framework extends to deep learning models with a wider range of activation functions.
Scale invariance is not required to reach the improved rate in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding opens the possibility that other restrictive assumptions in generalization analyses could also be relaxed while preserving the fast rate.
Empirical checks of the predicted rate on networks with non-invariant activations would provide a direct test of the bound.
The result suggests that activation choice may not be a limiting factor for theoretical fast rates in this framework.

Load-bearing premise

The existing framework for analyzing generalization error remains valid and can be applied without change to activation functions that lack scale invariance.

What would settle it

An empirical or theoretical demonstration that deep networks using sigmoid or hyperbolic tangent activations converge only at the slower rate of order one over square root of n, while the bound predicts a faster rate, would falsify the claim.

read the original abstract

In theoretical analysis of deep learning, discovering which features of deep learning lead to good performance is an important task. In this paper, using the framework for analyzing the generalization error developed in Suzuki (2018), we derive a fast learning rate for deep neural networks with more general activation functions. In Suzuki (2018), assuming the scale invariance of activation functions, the tight generalization error bound of deep learning was derived. They mention that the scale invariance of the activation function is essential to derive tight error bounds. Whereas the rectified linear unit (ReLU; Nair and Hinton, 2010) satisfies the scale invariance, the other famous activation functions including the sigmoid and the hyperbolic tangent functions, and the exponential linear unit (ELU; Clevert et al., 2016) does not satisfy this condition. The existing analysis indicates a possibility that a deep learning with the non scale invariant activations may have a slower convergence rate of $O(1/\sqrt{n})$ when one with the scale invariant activations can reach a rate faster than $O(1/\sqrt{n})$. In this paper, without the scale invariance of activation functions, we derive the tight generalization error bound which is essentially the same as that of Suzuki (2018). From this result, at least in the framework of Suzuki (2018), it is shown that the scale invariance of the activation functions is not essential to get the fast rate of convergence. Simultaneously, it is also shown that the theoretical framework proposed by Suzuki (2018) can be widely applied for analysis of deep learning with general activation functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper removes the scale-invariance assumption from Suzuki (2018) and recovers the same fast rate for activations like sigmoid and tanh.

read the letter

The main result is that the fast generalization bound from Suzuki (2018) still holds once the scale-invariance requirement on activations is dropped. The authors adapt the framework to cover non-invariant functions such as sigmoid, tanh, and ELU while keeping the rate essentially unchanged. This directly addresses the limitation noted in the earlier paper, where scale invariance was presented as essential. The extension is the concrete new piece: the same rate now applies more broadly without that restriction. The work is useful because it widens the reach of the Suzuki analysis to activations that are common in practice. The citation pattern is straightforward and builds cleanly on the referenced framework and activation-function papers. The central claim rests on the modifications to the proof being valid, and the abstract indicates they succeeded without introducing new restrictions that would slow the rate. Any soft spot is in the details of those modifications; the abstract alone does not show the lemmas, so a referee would need to verify that the adaptation preserves the fast rate without hidden steps. No internal contradictions appear from the description. This paper is for readers working on statistical learning theory for deep networks, especially those who want to know which assumptions are truly load-bearing. It deserves a serious referee because it fills a stated gap with a targeted extension rather than a broad new theory.

Referee Report

0 major / 1 minor

Summary. The paper extends the generalization error analysis framework of Suzuki (2018) to deep neural networks with activation functions that lack scale invariance (e.g., sigmoid, tanh, ELU). It derives a tight generalization error bound that is essentially the same as in Suzuki (2018), concluding that scale invariance is not essential for the fast rate of convergence within this framework and that the Suzuki framework applies more broadly to general activations.

Significance. If the derivation holds, the result broadens the applicability of fast-rate bounds to activation functions commonly used in practice, removing a potential restriction from Suzuki (2018) and confirming the framework's versatility. This addresses a gap between theoretical assumptions and empirical deep learning.

minor comments (1)

[Abstract] Abstract: the sentence stating that 'the other famous activation functions including the sigmoid and the hyperbolic tangent functions, and the exponential linear unit (ELU; Clevert et al., 2016) does not satisfy this condition' contains a subject-verb agreement error ('functions' is plural, so it should read 'do not').

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are glad that the extension of the Suzuki (2018) framework to non-scale-invariant activations is viewed as addressing a gap between theory and practice.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper explicitly extends the independent Suzuki (2018) framework by removing its scale-invariance assumption on activations and re-derives an equivalent generalization bound. This is a standard modification of prior external work rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation. No equations or steps in the abstract reduce the claimed result to its own inputs by construction; the derivation is presented as building on an externally developed analysis that remains valid after the modification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the validity of extending the Suzuki (2018) framework to non-scale-invariant activations; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption The generalization error analysis framework from Suzuki (2018) applies without the scale invariance assumption on activations.
Paper states it uses this framework to derive the bound for general activations.

pith-pipeline@v0.9.0 · 5816 in / 1151 out tokens · 27266 ms · 2026-05-24T16:13:55.323357+00:00 · methodology

Fast generalization error bound of deep learning without scale invariance of activation functions

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)