pith. sign in

arxiv: 1906.09235 · v2 · pith:6ZTQJG7Knew · submitted 2019-06-21 · 💻 cs.LG · math.OC· stat.ML

Theory of the Frequency Principle for General Deep Neural Networks

Pith reviewed 2026-05-25 18:57 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords Frequency PrincipleDeep Neural NetworksTraining DynamicsLow to High FrequenciesGeneral Activation FunctionsLoss FunctionsMultilayer Networks
0
0 comments X

The pith

Deep neural networks learn target functions from low to high frequencies at initial, intermediate, and final training stages for general multilayer setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes rigorous theorems showing that the Frequency Principle holds during DNN training. It demonstrates this pattern across three distinct stages using quantities that track frequency components. The results apply to multilayer networks, arbitrary activation functions, general data population densities, and a broad class of loss functions. A reader would care because the theorems give a foundation for analyzing why training proceeds in this ordered way rather than all frequencies at once.

Core claim

The authors prove three theorems, one for each training stage, that characterize the Frequency Principle in terms of suitable quantities. These theorems apply without restriction to the number of layers, the choice of activation functions, the data density, or the loss function within a large class. The work thereby shows that low-to-high frequency learning is a general feature of the dynamics rather than an artifact of specific architectures or losses.

What carries the argument

The Frequency Principle, characterized by proper quantities at each of the three training stages.

If this is right

  • Training dynamics can be tracked by monitoring frequency content separately in the initial, intermediate, and final phases.
  • The low-to-high ordering persists even when activation functions and loss functions are chosen freely within the allowed class.
  • Analysis of generalization or convergence can be performed by examining how each stage contributes to the overall frequency spectrum learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Design choices that accelerate the initial stage might shorten overall training time without harming the frequency ordering.
  • The same frequency tracking could be applied to compare different optimizers by how cleanly they separate the three stages.
  • If the quantities used to characterize each stage can be computed from data alone, they might serve as early-stopping signals.

Load-bearing premise

Suitable quantities exist that characterize the Frequency Principle at each of the three stages and that the stated generality over activations, data densities, and losses is enough for the theorems to hold.

What would settle it

A concrete counter-example would be any multilayer network with a standard activation and loss function whose training trajectory shows high-frequency components dominating before low-frequency ones at one of the three stages.

Figures

Figures reproduced from arXiv: 1906.09235 by Tao Luo, Yaoyu Zhang, Zheng Ma, Zhi-Qin John Xu.

Figure 2
Figure 2. Figure 2: We obtain similar results. 4 Proof of Theorems 4.1 F-Principle: Initial Stage (Theorem 1) In this section, we focus on the initial stage of the training dynamics. The first result shows that the change of loss function concentrates on low frequen￾cies. In general, C may depend on T. In the next section, we will provide a similar result in some situation where C does not depend on T. Proof of Theorem 1 (L 2… view at source ↗
Figure 1
Figure 1. Figure 1: Numerical understanding of theorems of MSE training loss. (a) [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Numerical understanding of theorems of L 4 training loss. The illus￾trations are same as [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Along with fruitful applications of Deep Neural Networks (DNNs) to realistic problems, recently, some empirical studies of DNNs reported a universal phenomenon of Frequency Principle (F-Principle): a DNN tends to learn a target function from low to high frequencies during the training. The F-Principle has been very useful in providing both qualitative and quantitative understandings of DNNs. In this paper, we rigorously investigate the F-Principle for the training dynamics of a general DNN at three stages: initial stage, intermediate stage, and final stage. For each stage, a theorem is provided in terms of proper quantities characterizing the F-Principle. Our results are general in the sense that they work for multilayer networks with general activation functions, population densities of data, and a large class of loss functions. Our work lays a theoretical foundation of the F-Principle for a better understanding of the training process of DNNs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to rigorously investigate the Frequency Principle (F-Principle) for the training dynamics of general deep neural networks at three stages (initial, intermediate, and final), providing a theorem for each stage in terms of quantities that characterize the principle. It asserts generality to multilayer networks with general activation functions, arbitrary population densities of data, and a large class of loss functions, thereby laying a theoretical foundation for understanding DNN training.

Significance. If the theorems hold under the stated generality, the work would provide a formal basis for the empirically observed low-to-high frequency learning behavior, enabling quantitative analysis of training dynamics across diverse architectures, activations, and losses. This could inform optimization strategies and generalization studies in deep learning.

major comments (1)
  1. [Abstract] Abstract: The central claim of applicability to general activations (including non-smooth ones such as ReLU) and a large class of losses (potentially non-convex or non-differentiable) is load-bearing, yet the abstract provides no explicit hypotheses ensuring that the characterizing quantities for the F-Principle exist, are finite, and behave as required at each of the three stages. Without such conditions, the theorems' scope does not automatically follow from the asserted generality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of applicability to general activations (including non-smooth ones such as ReLU) and a large class of losses (potentially non-convex or non-differentiable) is load-bearing, yet the abstract provides no explicit hypotheses ensuring that the characterizing quantities for the F-Principle exist, are finite, and behave as required at each of the three stages. Without such conditions, the theorems' scope does not automatically follow from the asserted generality.

    Authors: We agree that the abstract does not explicitly state the hypotheses under which the characterizing quantities exist, remain finite, and satisfy the required behavior at each stage. The theorems themselves are proved under assumptions that guarantee these conditions, but the abstract's generality claim would be clearer with a brief reference to them. We will revise the abstract to include a concise statement of the key hypotheses. revision: yes

Circularity Check

0 steps flagged

No circularity: theorems derive F-Principle characterizations from stated assumptions without reduction to inputs or self-citations

full rationale

The paper states three theorems characterizing the F-Principle at initial, intermediate, and final training stages using quantities defined for general activations, data densities, and losses. No quoted step reduces a claimed prediction or result to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The derivations are presented as independent mathematical statements rather than renamings or ansatzes imported from prior author work. This matches the default expectation for a theoretical paper whose central claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of suitable frequency-characterizing quantities and on the stated generality over activations, data, and losses; no free parameters or invented entities are indicated in the abstract.

axioms (2)
  • domain assumption Suitable quantities exist that characterize the Frequency Principle at initial, intermediate, and final training stages.
    Abstract states that theorems are given in terms of such quantities.
  • domain assumption The results hold for general activation functions, arbitrary population densities, and a large class of loss functions.
    Explicitly claimed as the generality of the theorems.

pith-pipeline@v0.9.0 · 5689 in / 1127 out tokens · 46456 ms · 2026-05-25T18:57:38.566292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Neural Spectral Bias and Conformal Correlators I: Introduction and Applications

    hep-th 2026-04 unverdicted novelty 8.0

    Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.

  2. A Greedy PDE Router for Blending Neural Operators and Classical Methods

    stat.ME 2025-09 unverdicted novelty 6.0

    An approximate greedy router for hybrid PDE solvers that mimics optimal selection without true error access and shows faster, more stable error reduction on test equations.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    PhaseDNN - A Parallel Phase Shift Deep Neural Network for Adaptive Wideband Learning

    Cai, W., Li, X., and Liu, L. Phasednn-a parallel phase shift deep neural network for adaptive wideband learning. arXiv preprint arXiv:1905.01389 ,

  2. [2]

    T., Spinello, L., Riedmiller, M., and Burgard, W

    Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., and Burgard, W. Multimodal deep learning for robust rgb-d object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE,

  3. [3]

    A multiscale neural network based on hierarchical matrices

    Fan, Y., Lin, L., Ying, L., and Zepeda-N´ unez, L. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 ,

  4. [4]

    Relu deep neural networks and linear finite elements

    He, J., Li, L., Xu, J., and Zheng, C. Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 ,

  5. [5]

    Jagtap, A. D. and Karniadakis, G. E. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. arXiv preprint arXiv:1906.01170,

  6. [6]

    Solving parametric pde problems with artificial neural networks

    Khoo, Y., Lu, J., and Ying, L. Solving parametric pde problems with artificial neural networks. arXiv preprint arXiv:1707.03351 ,

  7. [7]

    Rabinowitz, N. C. Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320,

  8. [8]

    On the Spectral Bias of Neural Networks

    Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y., and Courville, A. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734 ,

  9. [9]

    Mean Field Analysis of Neural Networks: A Central Limit Theorem

    Sirignano, J. and Spiliopoulos, K. Mean field analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372 ,

  10. [10]

    Xu, Z. J. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295 , 2018a. Xu, Z.-Q. J. Frequency principle in deep learning with general loss functions and its potential application. arXiv preprint arXiv:1811.10146 , 2018b. Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of deep neural netwo...

  11. [11]

    J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z

    Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Frequency princi- ple: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523,

  12. [13]

    Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks

    URL http://arxiv.org/abs/1905.10264. arXiv: 1905.10264. Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q., and Kwong, S. Nonlinear collaborative scheme for deep neural networks. arXiv preprint arXiv:1811.01316,