Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations

Fabricio Maci\`a; Shu Nakamura

arxiv: 2604.07715 · v1 · submitted 2026-04-09 · 💻 cs.LG · math.OC

Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations

Fabricio Maci\`a , Shu Nakamura This is my paper

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords one-layer neural networkReLUgradient descentconvergencespectral biasactivation functionFReXL2 loss

0 comments

The pith

One-hidden-layer network with fixed biases converges under gradient descent on L2 loss and shows spectral bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper rigorously proves convergence of the training process for a one-hidden-layer neural network that uses ReLU activations, has fixed biases, and takes scalar inputs and outputs. It also shows that gradient descent on the squared L2 loss exhibits spectral bias, so that lower-frequency components of the target function are learned earlier. The authors use the analysis to identify desirable properties of activation functions and introduce the full-wave rectified exponential function (FReX) as a candidate that satisfies those properties while preserving provable convergence.

Core claim

For the continuous and discrete versions of this one-layer model the gradient-descent flow on the L2 squared loss converges to a global minimizer; moreover the dynamics are governed by the spectrum of certain integral operators induced by the activation, which produces the observed spectral bias. The same operator analysis yields necessary conditions on the activation function and supports the introduction of FReX, for which convergence is likewise proved.

What carries the argument

The one-hidden-layer network with fixed biases and ReLU (or FReX) activation whose training dynamics reduce to gradient flow on a loss whose Hessian spectrum encodes both convergence and frequency bias.

If this is right

The parameters converge to values that globally minimize the L2 loss for any continuous target function.
Lower-frequency Fourier modes of the target are recovered first during training.
Activation functions must satisfy spectral conditions derived from the associated integral operators to guarantee convergence.
The proposed FReX activation inherits the same convergence guarantees as ReLU.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same operator-spectrum approach might be applied to other simple architectures to predict their bias toward smooth or low-frequency solutions.
If spectral bias persists when biases are allowed to train, the result would strengthen the claim that the phenomenon is intrinsic to gradient descent rather than an artifact of fixed biases.
Practical tests of FReX on low-dimensional regression tasks could check whether the theoretical convergence advantage translates to faster or more stable training.

Load-bearing premise

The entire analysis is carried out only for scalar input and output with all biases held fixed, which removes many degrees of freedom that are present in typical neural networks.

What would settle it

A numerical run of gradient descent on the exact one-dimensional model that either diverges or fails to learn low frequencies before high frequencies would falsify the convergence and spectral-bias claims.

read the original abstract

We analyze a simple one-hidden-layer neural network with ReLU activation functions and fixed biases, with one-dimensional input and output. We study both continuous and discrete versions of the model, and we rigorously prove the convergence of the learning process with the $L^2$ squared loss function and the gradient descent procedure. We also prove the spectral bias property for this learning process. Several conclusions of this analysis are discussed; in particular, regarding the structure and properties that activation functions should possess, as well as the relationships between the spectrum of certain operators and the learning process. Based on this, we also propose an alternative activation function, the full-wave rectified exponential function (FReX), and we discuss the convergence of the gradient descent with this alternative activation function.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper proves convergence and spectral bias for gradient descent on a one-layer 1D fixed-bias ReLU network and introduces FReX, but the setting is so narrow that the results stay mostly formal.

read the letter

The core contribution is a set of rigorous derivations showing that both continuous and discrete gradient descent converge for the L2 loss on this exact model, along with a spectral bias property tied to the integral operator spectrum. They also define FReX and sketch its convergence behavior. That is honest, scoped work on a toy model that lets them avoid many of the usual technical obstacles in neural network analysis. The discussion of activation function properties that follow from the operator spectrum is a reasonable takeaway from the math. Credit is due for staying within the stated assumptions rather than overclaiming generality. The proofs appear to rest on direct manipulation of the gradient flow equations without circular definitions or unstated regularity conditions. The stress-test note aligns with what the abstract and scope indicate: no load-bearing gaps in the 1D fixed-bias case. The main limitation is the extreme simplification itself. One-dimensional input and output, fixed biases, and a single hidden layer make the analysis feasible but also make it unclear how much carries over once biases are trained or inputs become vectors. FReX is presented as an alternative, yet the paper offers no numerical evidence or comparison that would show whether it behaves better in practice. This is the kind of paper that belongs in a reading group focused on theoretical dynamics of simple networks. Readers working on spectral bias or activation design in controlled settings could extract the operator analysis and the convergence arguments as building blocks. It is not broad enough to change how most practitioners think about training, but the formal grounding is sufficient to merit referee time. I would send it for peer review so the derivations can be checked in detail.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes a one-hidden-layer neural network with ReLU activations, fixed biases, and one-dimensional input/output. It rigorously proves convergence of both the continuous gradient flow and discrete gradient descent under squared L² loss, establishes the spectral bias property via the spectrum of an associated integral operator, and proposes the full-wave rectified exponential (FReX) activation function while discussing its convergence under the same training procedure.

Significance. If the derivations hold, the work supplies a concrete, fully-scoped mathematical treatment of gradient-descent dynamics and spectral bias for a deliberately simplified model. The explicit proofs for both continuous and discrete cases, together with the operator-theoretic framing of spectral bias and the analysis of a new activation, constitute a clear strength. Such results can serve as a reference point for understanding why spectral bias appears in practice and for guiding the design of activation functions, even though the setting is restricted to 1-D fixed-bias networks.

minor comments (3)

[Model definition] The model definition (early sections) introduces the network with fixed biases but does not explicitly state the precise function space in which the weights live; adding a short sentence clarifying that the weights are real scalars (or vectors in the 1-D case) would remove any ambiguity for readers.
[Discrete GD convergence] In the convergence proof for discrete gradient descent, the step-size restriction is stated in terms of a generic Lipschitz constant; an explicit upper bound derived from the network parameters would make the result more immediately usable.
[FReX proposal] The FReX activation is defined and its convergence is discussed, yet no plot or numerical comparison with ReLU on a simple target function is provided; a single illustrative figure would strengthen the claim that FReX is a viable alternative.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, including the rigorous proofs for gradient flow and discrete gradient descent convergence, the spectral bias analysis via the integral operator, and the proposal of the FReX activation function. The recommendation for minor revision is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained mathematical proofs

full rationale

The paper restricts itself to a concrete 1D one-hidden-layer model with fixed biases and ReLU (or FReX). Central results are explicit proofs of convergence for continuous/discrete gradient descent under L2 loss and of spectral bias, obtained directly from the gradient-flow ODEs and the spectrum of the associated integral operator. No parameters are fitted to data and then relabeled as predictions, no self-definitional loops appear in the activation or loss definitions, and no load-bearing uniqueness theorem is imported from the authors' prior work. Any self-citations are peripheral and do not substitute for the derivations. The analysis therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard properties of ReLU, gradient descent convergence in suitable loss landscapes, and the definition of a new activation; no free parameters are fitted to data in the stated results.

axioms (2)

domain assumption ReLU satisfies the standard piecewise-linear properties used in convergence arguments for gradient descent
Invoked throughout the model definition and proof sketches in the abstract.
domain assumption The loss landscape for the squared L2 loss on this model permits global convergence of gradient descent under the stated conditions
Central to the convergence claim.

invented entities (1)

FReX (full-wave rectified exponential) activation function no independent evidence
purpose: Alternative activation proposed to possess desirable structural properties identified in the analysis
Defined and analyzed in the paper; no independent empirical or theoretical validation outside this work is mentioned.

pith-pipeline@v0.9.0 · 5428 in / 1511 out tokens · 33841 ms · 2026-05-10T17:50:14.922353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We rigorously prove the convergence of the learning process with the L2 squared loss function and the gradient descent procedure. We also prove the spectral bias property... propose... FReX... fundamental solution of 1/2(-d²/dx² +1)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReLU''(z)=δ(z)... FReX satisfies 1/2(-d²/dx² +1)FReX=δ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma

[ABL+24] Harbir Antil, Thomas S. Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma. Deep neural nets with fixed bias configuration.Numerical Algebra, Control and Optimization, 14(1):20–33, 2024. [CFW+21] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quan- quan Gu. Towards understanding the spectral bias of deep learn- ing. InProceedings...

work page 2024
[2]

On the spectral bias of neural networks

Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Ka- malika Chaudhuri and Ruslan Salakhutdinov, editors,Proceed- ings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–5310. PMLR, 2019. [RS80] Michael Reed and Barry Sim...

work page 2019

[1] [1]

Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma

[ABL+24] Harbir Antil, Thomas S. Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma. Deep neural nets with fixed bias configuration.Numerical Algebra, Control and Optimization, 14(1):20–33, 2024. [CFW+21] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quan- quan Gu. Towards understanding the spectral bias of deep learn- ing. InProceedings...

work page 2024

[2] [2]

On the spectral bias of neural networks

Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Ka- malika Chaudhuri and Ruslan Salakhutdinov, editors,Proceed- ings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–5310. PMLR, 2019. [RS80] Michael Reed and Barry Sim...

work page 2019