pith. sign in

arxiv: 2604.07715 · v1 · submitted 2026-04-09 · 💻 cs.LG · math.OC

Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords one-layer neural networkReLUgradient descentconvergencespectral biasactivation functionFReXL2 loss
0
0 comments X

The pith

One-hidden-layer network with fixed biases converges under gradient descent on L2 loss and shows spectral bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper rigorously proves convergence of the training process for a one-hidden-layer neural network that uses ReLU activations, has fixed biases, and takes scalar inputs and outputs. It also shows that gradient descent on the squared L2 loss exhibits spectral bias, so that lower-frequency components of the target function are learned earlier. The authors use the analysis to identify desirable properties of activation functions and introduce the full-wave rectified exponential function (FReX) as a candidate that satisfies those properties while preserving provable convergence.

Core claim

For the continuous and discrete versions of this one-layer model the gradient-descent flow on the L2 squared loss converges to a global minimizer; moreover the dynamics are governed by the spectrum of certain integral operators induced by the activation, which produces the observed spectral bias. The same operator analysis yields necessary conditions on the activation function and supports the introduction of FReX, for which convergence is likewise proved.

What carries the argument

The one-hidden-layer network with fixed biases and ReLU (or FReX) activation whose training dynamics reduce to gradient flow on a loss whose Hessian spectrum encodes both convergence and frequency bias.

If this is right

  • The parameters converge to values that globally minimize the L2 loss for any continuous target function.
  • Lower-frequency Fourier modes of the target are recovered first during training.
  • Activation functions must satisfy spectral conditions derived from the associated integral operators to guarantee convergence.
  • The proposed FReX activation inherits the same convergence guarantees as ReLU.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same operator-spectrum approach might be applied to other simple architectures to predict their bias toward smooth or low-frequency solutions.
  • If spectral bias persists when biases are allowed to train, the result would strengthen the claim that the phenomenon is intrinsic to gradient descent rather than an artifact of fixed biases.
  • Practical tests of FReX on low-dimensional regression tasks could check whether the theoretical convergence advantage translates to faster or more stable training.

Load-bearing premise

The entire analysis is carried out only for scalar input and output with all biases held fixed, which removes many degrees of freedom that are present in typical neural networks.

What would settle it

A numerical run of gradient descent on the exact one-dimensional model that either diverges or fails to learn low frequencies before high frequencies would falsify the convergence and spectral-bias claims.

read the original abstract

We analyze a simple one-hidden-layer neural network with ReLU activation functions and fixed biases, with one-dimensional input and output. We study both continuous and discrete versions of the model, and we rigorously prove the convergence of the learning process with the $L^2$ squared loss function and the gradient descent procedure. We also prove the spectral bias property for this learning process. Several conclusions of this analysis are discussed; in particular, regarding the structure and properties that activation functions should possess, as well as the relationships between the spectrum of certain operators and the learning process. Based on this, we also propose an alternative activation function, the full-wave rectified exponential function (FReX), and we discuss the convergence of the gradient descent with this alternative activation function.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes a one-hidden-layer neural network with ReLU activations, fixed biases, and one-dimensional input/output. It rigorously proves convergence of both the continuous gradient flow and discrete gradient descent under squared L² loss, establishes the spectral bias property via the spectrum of an associated integral operator, and proposes the full-wave rectified exponential (FReX) activation function while discussing its convergence under the same training procedure.

Significance. If the derivations hold, the work supplies a concrete, fully-scoped mathematical treatment of gradient-descent dynamics and spectral bias for a deliberately simplified model. The explicit proofs for both continuous and discrete cases, together with the operator-theoretic framing of spectral bias and the analysis of a new activation, constitute a clear strength. Such results can serve as a reference point for understanding why spectral bias appears in practice and for guiding the design of activation functions, even though the setting is restricted to 1-D fixed-bias networks.

minor comments (3)
  1. [Model definition] The model definition (early sections) introduces the network with fixed biases but does not explicitly state the precise function space in which the weights live; adding a short sentence clarifying that the weights are real scalars (or vectors in the 1-D case) would remove any ambiguity for readers.
  2. [Discrete GD convergence] In the convergence proof for discrete gradient descent, the step-size restriction is stated in terms of a generic Lipschitz constant; an explicit upper bound derived from the network parameters would make the result more immediately usable.
  3. [FReX proposal] The FReX activation is defined and its convergence is discussed, yet no plot or numerical comparison with ReLU on a simple target function is provided; a single illustrative figure would strengthen the claim that FReX is a viable alternative.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, including the rigorous proofs for gradient flow and discrete gradient descent convergence, the spectral bias analysis via the integral operator, and the proposal of the FReX activation function. The recommendation for minor revision is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained mathematical proofs

full rationale

The paper restricts itself to a concrete 1D one-hidden-layer model with fixed biases and ReLU (or FReX). Central results are explicit proofs of convergence for continuous/discrete gradient descent under L2 loss and of spectral bias, obtained directly from the gradient-flow ODEs and the spectrum of the associated integral operator. No parameters are fitted to data and then relabeled as predictions, no self-definitional loops appear in the activation or loss definitions, and no load-bearing uniqueness theorem is imported from the authors' prior work. Any self-citations are peripheral and do not substitute for the derivations. The analysis therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard properties of ReLU, gradient descent convergence in suitable loss landscapes, and the definition of a new activation; no free parameters are fitted to data in the stated results.

axioms (2)
  • domain assumption ReLU satisfies the standard piecewise-linear properties used in convergence arguments for gradient descent
    Invoked throughout the model definition and proof sketches in the abstract.
  • domain assumption The loss landscape for the squared L2 loss on this model permits global convergence of gradient descent under the stated conditions
    Central to the convergence claim.
invented entities (1)
  • FReX (full-wave rectified exponential) activation function no independent evidence
    purpose: Alternative activation proposed to possess desirable structural properties identified in the analysis
    Defined and analyzed in the paper; no independent empirical or theoretical validation outside this work is mentioned.

pith-pipeline@v0.9.0 · 5428 in / 1511 out tokens · 33841 ms · 2026-05-10T17:50:14.922353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma

    [ABL+24] Harbir Antil, Thomas S. Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma. Deep neural nets with fixed bias configuration.Numerical Algebra, Control and Optimization, 14(1):20–33, 2024. [CFW+21] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quan- quan Gu. Towards understanding the spectral bias of deep learn- ing. InProceedings...

  2. [2]

    On the spectral bias of neural networks

    Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Ka- malika Chaudhuri and Ruslan Salakhutdinov, editors,Proceed- ings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–5310. PMLR, 2019. [RS80] Michael Reed and Barry Sim...