Theory of the Frequency Principle for General Deep Neural Networks

Tao Luo; Yaoyu Zhang; Zheng Ma; Zhi-Qin John Xu

arxiv: 1906.09235 · v2 · pith:6ZTQJG7Knew · submitted 2019-06-21 · 💻 cs.LG · math.OC· stat.ML

Theory of the Frequency Principle for General Deep Neural Networks

Tao Luo , Zheng Ma , Zhi-Qin John Xu , Yaoyu Zhang This is my paper

Pith reviewed 2026-05-25 18:57 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords Frequency PrincipleDeep Neural NetworksTraining DynamicsLow to High FrequenciesGeneral Activation FunctionsLoss FunctionsMultilayer Networks

0 comments

The pith

Deep neural networks learn target functions from low to high frequencies at initial, intermediate, and final training stages for general multilayer setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes rigorous theorems showing that the Frequency Principle holds during DNN training. It demonstrates this pattern across three distinct stages using quantities that track frequency components. The results apply to multilayer networks, arbitrary activation functions, general data population densities, and a broad class of loss functions. A reader would care because the theorems give a foundation for analyzing why training proceeds in this ordered way rather than all frequencies at once.

Core claim

The authors prove three theorems, one for each training stage, that characterize the Frequency Principle in terms of suitable quantities. These theorems apply without restriction to the number of layers, the choice of activation functions, the data density, or the loss function within a large class. The work thereby shows that low-to-high frequency learning is a general feature of the dynamics rather than an artifact of specific architectures or losses.

What carries the argument

The Frequency Principle, characterized by proper quantities at each of the three training stages.

If this is right

Training dynamics can be tracked by monitoring frequency content separately in the initial, intermediate, and final phases.
The low-to-high ordering persists even when activation functions and loss functions are chosen freely within the allowed class.
Analysis of generalization or convergence can be performed by examining how each stage contributes to the overall frequency spectrum learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design choices that accelerate the initial stage might shorten overall training time without harming the frequency ordering.
The same frequency tracking could be applied to compare different optimizers by how cleanly they separate the three stages.
If the quantities used to characterize each stage can be computed from data alone, they might serve as early-stopping signals.

Load-bearing premise

Suitable quantities exist that characterize the Frequency Principle at each of the three stages and that the stated generality over activations, data densities, and losses is enough for the theorems to hold.

What would settle it

A concrete counter-example would be any multilayer network with a standard activation and loss function whose training trajectory shows high-frequency components dominating before low-frequency ones at one of the three stages.

Figures

Figures reproduced from arXiv: 1906.09235 by Tao Luo, Yaoyu Zhang, Zheng Ma, Zhi-Qin John Xu.

**Figure 2.** Figure 2: We obtain similar results. 4 Proof of Theorems 4.1 F-Principle: Initial Stage (Theorem 1) In this section, we focus on the initial stage of the training dynamics. The first result shows that the change of loss function concentrates on low frequencies. In general, C may depend on T. In the next section, we will provide a similar result in some situation where C does not depend on T. Proof of Theorem 1 (L 2… view at source ↗

**Figure 1.** Figure 1: Numerical understanding of theorems of MSE training loss. (a) [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

**Figure 2.** Figure 2: Numerical understanding of theorems of L 4 training loss. The illustrations are same as [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Along with fruitful applications of Deep Neural Networks (DNNs) to realistic problems, recently, some empirical studies of DNNs reported a universal phenomenon of Frequency Principle (F-Principle): a DNN tends to learn a target function from low to high frequencies during the training. The F-Principle has been very useful in providing both qualitative and quantitative understandings of DNNs. In this paper, we rigorously investigate the F-Principle for the training dynamics of a general DNN at three stages: initial stage, intermediate stage, and final stage. For each stage, a theorem is provided in terms of proper quantities characterizing the F-Principle. Our results are general in the sense that they work for multilayer networks with general activation functions, population densities of data, and a large class of loss functions. Our work lays a theoretical foundation of the F-Principle for a better understanding of the training process of DNNs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper states theorems for the F-Principle at three training stages for general DNNs, but the claimed scope over arbitrary activations and losses rests on unverified existence of the characterizing quantities.

read the letter

The core contribution is three theorems that track frequency bias in DNN training from start through convergence, stated for multilayer nets, general activations, arbitrary data densities, and a wide loss class. This moves the F-Principle from repeated empirical reports to explicit statements about initial, intermediate, and final regimes, which is the main advance. The setup is clean in separating the stages and tying each to a suitable frequency-domain quantity. That structure is useful for anyone who wants a predictive handle on why low frequencies appear first. The citation pattern is straightforward and points back to the empirical papers that first noticed the effect, without circular self-reference. No machine-checked proofs or released code are mentioned, so the results stand or fall on the written derivations. The soft spot is exactly the one flagged in the stress-test note. The theorems require that the frequency-characterizing quantities exist and behave as described at every stage. The abstract supplies no extra hypotheses that would guarantee this when the activation is only Lipschitz (ReLU) or the loss is non-differentiable. If those quantities can fail to be finite or well-defined under the stated generality, the theorems cover a smaller set of networks and losses than advertised. A reader who needs the full scope will have to check the actual definitions and proof steps. This paper is for theorists who work on training dynamics and want a formal starting point for the frequency bias. It is coherent on its own terms and engages the existing literature, so it is worth a serious referee even if the generality needs tightening. I would send it to review.

Referee Report

1 major / 0 minor

Summary. The paper claims to rigorously investigate the Frequency Principle (F-Principle) for the training dynamics of general deep neural networks at three stages (initial, intermediate, and final), providing a theorem for each stage in terms of quantities that characterize the principle. It asserts generality to multilayer networks with general activation functions, arbitrary population densities of data, and a large class of loss functions, thereby laying a theoretical foundation for understanding DNN training.

Significance. If the theorems hold under the stated generality, the work would provide a formal basis for the empirically observed low-to-high frequency learning behavior, enabling quantitative analysis of training dynamics across diverse architectures, activations, and losses. This could inform optimization strategies and generalization studies in deep learning.

major comments (1)

[Abstract] Abstract: The central claim of applicability to general activations (including non-smooth ones such as ReLU) and a large class of losses (potentially non-convex or non-differentiable) is load-bearing, yet the abstract provides no explicit hypotheses ensuring that the characterizing quantities for the F-Principle exist, are finite, and behave as required at each of the three stages. Without such conditions, the theorems' scope does not automatically follow from the asserted generality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of applicability to general activations (including non-smooth ones such as ReLU) and a large class of losses (potentially non-convex or non-differentiable) is load-bearing, yet the abstract provides no explicit hypotheses ensuring that the characterizing quantities for the F-Principle exist, are finite, and behave as required at each of the three stages. Without such conditions, the theorems' scope does not automatically follow from the asserted generality.

Authors: We agree that the abstract does not explicitly state the hypotheses under which the characterizing quantities exist, remain finite, and satisfy the required behavior at each stage. The theorems themselves are proved under assumptions that guarantee these conditions, but the abstract's generality claim would be clearer with a brief reference to them. We will revise the abstract to include a concise statement of the key hypotheses. revision: yes

Circularity Check

0 steps flagged

No circularity: theorems derive F-Principle characterizations from stated assumptions without reduction to inputs or self-citations

full rationale

The paper states three theorems characterizing the F-Principle at initial, intermediate, and final training stages using quantities defined for general activations, data densities, and losses. No quoted step reduces a claimed prediction or result to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The derivations are presented as independent mathematical statements rather than renamings or ansatzes imported from prior author work. This matches the default expectation for a theoretical paper whose central claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of suitable frequency-characterizing quantities and on the stated generality over activations, data, and losses; no free parameters or invented entities are indicated in the abstract.

axioms (2)

domain assumption Suitable quantities exist that characterize the Frequency Principle at initial, intermediate, and final training stages.
Abstract states that theorems are given in terms of such quantities.
domain assumption The results hold for general activation functions, arbitrary population densities, and a large class of loss functions.
Explicitly claimed as the generality of the theorems.

pith-pipeline@v0.9.0 · 5689 in / 1127 out tokens · 46456 ms · 2026-05-25T18:57:38.566292+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For each stage, a theorem is provided in terms of proper quantities characterizing the F-Principle... |dL+ρ,η/dt| / |dLρ/dt| ≤ C η−m
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

⟨·⟩m |∇θqρ(·,θ)| ∈ L1(Rd) ... from Wk,2 regularity of h and f

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neural Spectral Bias and Conformal Correlators I: Introduction and Applications
hep-th 2026-04 unverdicted novelty 8.0

Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.
A Greedy PDE Router for Blending Neural Operators and Classical Methods
stat.ME 2025-09 unverdicted novelty 6.0

An approximate greedy router for hybrid PDE solvers that mimics optimal selection without true error access and shows faster, more stable error reduction on test equations.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

PhaseDNN - A Parallel Phase Shift Deep Neural Network for Adaptive Wideband Learning

Cai, W., Li, X., and Liu, L. Phasednn-a parallel phase shift deep neural network for adaptive wideband learning. arXiv preprint arXiv:1905.01389 ,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[2]

T., Spinello, L., Riedmiller, M., and Burgard, W

Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., and Burgard, W. Multimodal deep learning for robust rgb-d object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE,

work page 2015
[3]

A multiscale neural network based on hierarchical matrices

Fan, Y., Lin, L., Ying, L., and Zepeda-N´ unez, L. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 ,

work page arXiv
[4]

Relu deep neural networks and linear ﬁnite elements

He, J., Li, L., Xu, J., and Zheng, C. Relu deep neural networks and linear ﬁnite elements. arXiv preprint arXiv:1807.03973 ,

work page arXiv
[5]

Jagtap, A. D. and Karniadakis, G. E. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. arXiv preprint arXiv:1906.01170,

work page arXiv 1906
[6]

Solving parametric pde problems with artiﬁcial neural networks

Khoo, Y., Lu, J., and Ying, L. Solving parametric pde problems with artiﬁcial neural networks. arXiv preprint arXiv:1707.03351 ,

work page arXiv
[7]

Rabinowitz, N. C. Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[8]

On the Spectral Bias of Neural Networks

Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y., and Courville, A. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734 ,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mean Field Analysis of Neural Networks: A Central Limit Theorem

Sirignano, J. and Spiliopoulos, K. Mean ﬁeld analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Xu, Z. J. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295 , 2018a. Xu, Z.-Q. J. Frequency principle in deep learning with general loss functions and its potential application. arXiv preprint arXiv:1811.10146 , 2018b. Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of deep neural netwo...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z

Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Frequency princi- ple: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523,

work page arXiv 1901
[13]

Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks

URL http://arxiv.org/abs/1905.10264. arXiv: 1905.10264. Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q., and Kwong, S. Nonlinear collaborative scheme for deep neural networks. arXiv preprint arXiv:1811.01316,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[1] [1]

PhaseDNN - A Parallel Phase Shift Deep Neural Network for Adaptive Wideband Learning

Cai, W., Li, X., and Liu, L. Phasednn-a parallel phase shift deep neural network for adaptive wideband learning. arXiv preprint arXiv:1905.01389 ,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[2] [2]

T., Spinello, L., Riedmiller, M., and Burgard, W

Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., and Burgard, W. Multimodal deep learning for robust rgb-d object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE,

work page 2015

[3] [3]

A multiscale neural network based on hierarchical matrices

Fan, Y., Lin, L., Ying, L., and Zepeda-N´ unez, L. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 ,

work page arXiv

[4] [4]

Relu deep neural networks and linear ﬁnite elements

He, J., Li, L., Xu, J., and Zheng, C. Relu deep neural networks and linear ﬁnite elements. arXiv preprint arXiv:1807.03973 ,

work page arXiv

[5] [5]

Jagtap, A. D. and Karniadakis, G. E. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. arXiv preprint arXiv:1906.01170,

work page arXiv 1906

[6] [6]

Solving parametric pde problems with artiﬁcial neural networks

Khoo, Y., Lu, J., and Ying, L. Solving parametric pde problems with artiﬁcial neural networks. arXiv preprint arXiv:1707.03351 ,

work page arXiv

[7] [7]

Rabinowitz, N. C. Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[8] [8]

On the Spectral Bias of Neural Networks

Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y., and Courville, A. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734 ,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mean Field Analysis of Neural Networks: A Central Limit Theorem

Sirignano, J. and Spiliopoulos, K. Mean ﬁeld analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372 ,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Xu, Z. J. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295 , 2018a. Xu, Z.-Q. J. Frequency principle in deep learning with general loss functions and its potential application. arXiv preprint arXiv:1811.10146 , 2018b. Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of deep neural netwo...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z

Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Frequency princi- ple: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523,

work page arXiv 1901

[12] [13]

Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks

URL http://arxiv.org/abs/1905.10264. arXiv: 1905.10264. Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q., and Kwong, S. Nonlinear collaborative scheme for deep neural networks. arXiv preprint arXiv:1811.01316,

work page internal anchor Pith review Pith/arXiv arXiv 1905