Theory of the Frequency Principle for General Deep Neural Networks
Pith reviewed 2026-05-25 18:57 UTC · model grok-4.3
The pith
Deep neural networks learn target functions from low to high frequencies at initial, intermediate, and final training stages for general multilayer setups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors prove three theorems, one for each training stage, that characterize the Frequency Principle in terms of suitable quantities. These theorems apply without restriction to the number of layers, the choice of activation functions, the data density, or the loss function within a large class. The work thereby shows that low-to-high frequency learning is a general feature of the dynamics rather than an artifact of specific architectures or losses.
What carries the argument
The Frequency Principle, characterized by proper quantities at each of the three training stages.
If this is right
- Training dynamics can be tracked by monitoring frequency content separately in the initial, intermediate, and final phases.
- The low-to-high ordering persists even when activation functions and loss functions are chosen freely within the allowed class.
- Analysis of generalization or convergence can be performed by examining how each stage contributes to the overall frequency spectrum learned.
Where Pith is reading between the lines
- Design choices that accelerate the initial stage might shorten overall training time without harming the frequency ordering.
- The same frequency tracking could be applied to compare different optimizers by how cleanly they separate the three stages.
- If the quantities used to characterize each stage can be computed from data alone, they might serve as early-stopping signals.
Load-bearing premise
Suitable quantities exist that characterize the Frequency Principle at each of the three stages and that the stated generality over activations, data densities, and losses is enough for the theorems to hold.
What would settle it
A concrete counter-example would be any multilayer network with a standard activation and loss function whose training trajectory shows high-frequency components dominating before low-frequency ones at one of the three stages.
Figures
read the original abstract
Along with fruitful applications of Deep Neural Networks (DNNs) to realistic problems, recently, some empirical studies of DNNs reported a universal phenomenon of Frequency Principle (F-Principle): a DNN tends to learn a target function from low to high frequencies during the training. The F-Principle has been very useful in providing both qualitative and quantitative understandings of DNNs. In this paper, we rigorously investigate the F-Principle for the training dynamics of a general DNN at three stages: initial stage, intermediate stage, and final stage. For each stage, a theorem is provided in terms of proper quantities characterizing the F-Principle. Our results are general in the sense that they work for multilayer networks with general activation functions, population densities of data, and a large class of loss functions. Our work lays a theoretical foundation of the F-Principle for a better understanding of the training process of DNNs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to rigorously investigate the Frequency Principle (F-Principle) for the training dynamics of general deep neural networks at three stages (initial, intermediate, and final), providing a theorem for each stage in terms of quantities that characterize the principle. It asserts generality to multilayer networks with general activation functions, arbitrary population densities of data, and a large class of loss functions, thereby laying a theoretical foundation for understanding DNN training.
Significance. If the theorems hold under the stated generality, the work would provide a formal basis for the empirically observed low-to-high frequency learning behavior, enabling quantitative analysis of training dynamics across diverse architectures, activations, and losses. This could inform optimization strategies and generalization studies in deep learning.
major comments (1)
- [Abstract] Abstract: The central claim of applicability to general activations (including non-smooth ones such as ReLU) and a large class of losses (potentially non-convex or non-differentiable) is load-bearing, yet the abstract provides no explicit hypotheses ensuring that the characterizing quantities for the F-Principle exist, are finite, and behave as required at each of the three stages. Without such conditions, the theorems' scope does not automatically follow from the asserted generality.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of applicability to general activations (including non-smooth ones such as ReLU) and a large class of losses (potentially non-convex or non-differentiable) is load-bearing, yet the abstract provides no explicit hypotheses ensuring that the characterizing quantities for the F-Principle exist, are finite, and behave as required at each of the three stages. Without such conditions, the theorems' scope does not automatically follow from the asserted generality.
Authors: We agree that the abstract does not explicitly state the hypotheses under which the characterizing quantities exist, remain finite, and satisfy the required behavior at each stage. The theorems themselves are proved under assumptions that guarantee these conditions, but the abstract's generality claim would be clearer with a brief reference to them. We will revise the abstract to include a concise statement of the key hypotheses. revision: yes
Circularity Check
No circularity: theorems derive F-Principle characterizations from stated assumptions without reduction to inputs or self-citations
full rationale
The paper states three theorems characterizing the F-Principle at initial, intermediate, and final training stages using quantities defined for general activations, data densities, and losses. No quoted step reduces a claimed prediction or result to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The derivations are presented as independent mathematical statements rather than renamings or ansatzes imported from prior author work. This matches the default expectation for a theoretical paper whose central claims remain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Suitable quantities exist that characterize the Frequency Principle at initial, intermediate, and final training stages.
- domain assumption The results hold for general activation functions, arbitrary population densities, and a large class of loss functions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For each stage, a theorem is provided in terms of proper quantities characterizing the F-Principle... |dL+ρ,η/dt| / |dLρ/dt| ≤ C η−m
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
⟨·⟩m |∇θqρ(·,θ)| ∈ L1(Rd) ... from Wk,2 regularity of h and f
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Neural Spectral Bias and Conformal Correlators I: Introduction and Applications
Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.
-
A Greedy PDE Router for Blending Neural Operators and Classical Methods
An approximate greedy router for hybrid PDE solvers that mimics optimal selection without true error access and shows faster, more stable error reduction on test equations.
Reference graph
Works this paper leans on
-
[1]
PhaseDNN - A Parallel Phase Shift Deep Neural Network for Adaptive Wideband Learning
Cai, W., Li, X., and Liu, L. Phasednn-a parallel phase shift deep neural network for adaptive wideband learning. arXiv preprint arXiv:1905.01389 ,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[2]
T., Spinello, L., Riedmiller, M., and Burgard, W
Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., and Burgard, W. Multimodal deep learning for robust rgb-d object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE,
work page 2015
-
[3]
A multiscale neural network based on hierarchical matrices
Fan, Y., Lin, L., Ying, L., and Zepeda-N´ unez, L. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 ,
-
[4]
Relu deep neural networks and linear finite elements
He, J., Li, L., Xu, J., and Zheng, C. Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 ,
- [5]
-
[6]
Solving parametric pde problems with artificial neural networks
Khoo, Y., Lu, J., and Ying, L. Solving parametric pde problems with artificial neural networks. arXiv preprint arXiv:1707.03351 ,
-
[7]
Rabinowitz, N. C. Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[8]
On the Spectral Bias of Neural Networks
Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y., and Courville, A. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mean Field Analysis of Neural Networks: A Central Limit Theorem
Sirignano, J. and Spiliopoulos, K. Mean field analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Xu, Z. J. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295 , 2018a. Xu, Z.-Q. J. Frequency principle in deep learning with general loss functions and its potential application. arXiv preprint arXiv:1811.10146 , 2018b. Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of deep neural netwo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z
Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Frequency princi- ple: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523,
-
[13]
Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks
URL http://arxiv.org/abs/1905.10264. arXiv: 1905.10264. Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q., and Kwong, S. Nonlinear collaborative scheme for deep neural networks. arXiv preprint arXiv:1811.01316,
work page internal anchor Pith review Pith/arXiv arXiv 1905
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.