A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights
Pith reviewed 2026-05-19 19:36 UTC · model grok-4.3
The pith
Online SGD cannot learn phase-only classification on isotropic high-dimensional inputs before order N cubed steps, but power-law spectra accelerate it substantially.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online SGD cannot distinguish the structured inputs from noise within n much less than N cubed steps, but needs at least n much greater than N cubed log squared N steps. Power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification itself.
What carries the argument
A synthetic data model for translation-invariant inputs that separates control of amplitudes and phases while preserving tractability for SGD analysis.
If this is right
- Networks trained on images first rely on amplitude information before exploiting phase information.
- Power-law spectra accelerate phase learning even without improving final accuracy.
- The same amplitude-before-phase progression appears in deep convolutional networks on CIFAR100 and ImageNet.
- This amplitude-phase interaction explains how networks learn natural image distributions efficiently.
Where Pith is reading between the lines
- The hardness result may extend to other high-dimensional data with flat spectra.
- Power-law acceleration could be tested on regression tasks or different architectures.
- The model offers a way to study how translation invariance interacts with spectral properties during training.
Load-bearing premise
The synthetic data model for translation-invariant inputs captures the real interaction between amplitudes, phases, and SGD dynamics without artifacts that would change the hardness or acceleration results.
What would settle it
An experiment showing that online SGD succeeds at phase-only classification on high-dimensional isotropic inputs in substantially fewer than N cubed steps would disprove the hardness claim.
Figures
read the original abstract
Neural networks trained with gradient-based methods exhibit a strong simplicity bias: they learn simpler statistical features of their data before moving to more complex features. Previous analyses of this phenomenon have largely focused on settings with (quasi-)isotropic inputs. In this work, we study the simplicity bias from a Fourier perspective, which allows us to include two key features of natural images in the analysis: approximate translation-invariance and power-law spectra. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information -- related to pair-wise correlations between pixels -- before exploiting phase information, which encodes edges and higher-order correlations. In view of this, we introduce a synthetic data model for translation-invariant inputs that allows precise control over amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online stochastic gradient descent (SGD) cannot distinguish the structured inputs from noise within $n \ll N^3$ steps, but needs at least $n \gg N^3 \log^2{N}$ steps. In contrast, we show both experimentally and theoretically that power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification. Simulations with two-layer networks trained on textures and with deep convolutional networks on ImageNet and CIFAR100 confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insights into how deep neural networks can learn natural image distributions efficiently.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that neural networks exhibit a simplicity bias by learning amplitude information (pairwise pixel correlations) before phase information (edges and higher-order correlations) when trained on image classification. From a Fourier perspective incorporating translation invariance and power-law spectra, the authors introduce a synthetic data model for translation-invariant inputs. They rigorously prove that for isotropic high-dimensional inputs, online SGD cannot learn phase-only classification within n ≪ N³ steps and requires at least n ≫ N³ log²N steps. They further show both theoretically and experimentally that power-law spectra accelerate phase learning even when spectra do not aid classification directly. Experiments with two-layer networks on textures and deep CNNs on ImageNet/CIFAR100 support the amplitude-to-phase transition and the non-trivial interaction.
Significance. If the results hold, the work provides mechanistic insights into efficient learning of natural image distributions by deep networks, extending simplicity bias analyses beyond quasi-isotropic inputs. The combination of rigorous sample-complexity bounds for the synthetic model, power-law acceleration derivations, and empirical validation on real datasets strengthens the Fourier-based explanation of learning dynamics. The parameter-free nature of the hardness lower bound and the reproducible experimental setup on standard benchmarks are notable strengths.
major comments (2)
- [§3.2] §3.2 (Synthetic data model definition): The central hardness claim that phase-only classification is information-theoretically and algorithmically hard for online SGD (requiring n ≫ N³ log²N) depends on the model introducing no unintended label-correlated phase alignments or higher-order dependencies under the translation-invariance constraint. The phase sampling procedure could embed weak correlations that invalidate the lower bound as a general statement about isotropic inputs; an explicit proof or numerical verification that labels remain independent of phases in the Fourier domain is needed to confirm the result is not model-specific.
- [Theorem 4.1] Theorem 4.1 (Hardness lower bound for online SGD): The derivation assumes the synthetic model faithfully captures the interaction between amplitudes, phases, and dynamics without artifacts. If the translation-invariance enforcement introduces even mild phase-label dependencies, the claimed separation from noise (n ≪ N³ vs n ≫ N³ log²N) may not hold in the intended regime; a direct comparison to a fully random-phase baseline would clarify whether the bound is tight.
minor comments (2)
- [Figure 3] Figure 3 and associated text: Error bars or multiple random seeds are not reported for the ImageNet/CIFAR100 runs, making it harder to assess the statistical significance of the observed amplitude-to-phase transition.
- [§2] Notation: The definition of the Fourier transform and the precise normalization used for amplitudes/phases should be stated explicitly in §2 to avoid ambiguity when comparing to standard image processing conventions.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our work and for the constructive comments on the synthetic data model and hardness results. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Synthetic data model definition): The central hardness claim that phase-only classification is information-theoretically and algorithmically hard for online SGD (requiring n ≫ N³ log²N) depends on the model introducing no unintended label-correlated phase alignments or higher-order dependencies under the translation-invariance constraint. The phase sampling procedure could embed weak correlations that invalidate the lower bound as a general statement about isotropic inputs; an explicit proof or numerical verification that labels remain independent of phases in the Fourier domain is needed to confirm the result is not model-specific.
Authors: We agree that an explicit check for label-phase independence is important to ensure the hardness result is not an artifact of the model construction. In Section 3.2, phases are drawn independently and uniformly, and the label is generated from a translation-invariant function of the full phase vector (specifically, a thresholded sum over selected frequency interactions). This construction is designed to make the label uncorrelated with any fixed subset of phases. To confirm, we have added numerical verification in the revision: the empirical correlation between the label and each individual phase coefficient is statistically indistinguishable from zero across multiple random seeds, and mutual information estimates are at the level of sampling noise. We will include this as a new panel in Figure 3 (or an appendix) to substantiate that no unintended dependencies are present. revision: yes
-
Referee: [Theorem 4.1] Theorem 4.1 (Hardness lower bound for online SGD): The derivation assumes the synthetic model faithfully captures the interaction between amplitudes, phases, and dynamics without artifacts. If the translation-invariance enforcement introduces even mild phase-label dependencies, the claimed separation from noise (n ≪ N³ vs n ≫ N³ log²N) may not hold in the intended regime; a direct comparison to a fully random-phase baseline would clarify whether the bound is tight.
Authors: We appreciate the suggestion for a random-phase baseline comparison. The proof of Theorem 4.1 shows that the expected gradient contribution from the phase variables vanishes under isotropy, with the N³ scaling arising from the variance of the stochastic updates. To verify that translation invariance does not introduce spurious dependencies that would invalidate the separation, we will add experiments in the revised version comparing our structured-phase model against a fully random-phase control (where labels are assigned independently of the input). The random-phase case learns at the rate expected for pure noise, while the structured case exhibits the predicted delay, confirming that the bound reflects the intended phase-learning difficulty rather than model artifacts. revision: yes
Circularity Check
Standard high-dimensional SGD analysis supports hardness result without reduction to fitted inputs or self-citations
full rationale
The paper introduces a synthetic translation-invariant data model to control amplitudes and phases, then rigorously derives the online SGD hardness bound (n ≪ N³ vs n ≫ N³ log²N) for phase-only classification using standard high-dimensional analysis techniques. This does not reduce by construction to quantities fitted from the target result, nor does it rely on load-bearing self-citations or ansatzes smuggled from prior work. Power-law acceleration is shown both theoretically and via independent experiments on textures/ImageNet. The derivation chain is self-contained and externally falsifiable via the stated assumptions on isotropic inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Inputs are high-dimensional, isotropic, and translation-invariant for the hardness result on phase learning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that when the inputs have isotropic covariance, weakly recovering information carried exclusively by the phases requires a sample complexity on the order of n≫N³ for online SGD... information exponent k*=4
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
power-law spectra can dramatically accelerate the speed of learning phase information... λ_k0≈√N ... effective signal-to-noise ratio λ²_k0≈N
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kalimeris, D. et al.SGD on Neural Networks Learns Functions of Increasing ComplexityinAdvances in Neural Information Processing Systems32(2019), 3491–3501
work page 2019
-
[2]
Ingrosso, A. & Goldt, S. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences119(2022)
work page 2022
-
[3]
Refinetti, M., Ingrosso, A. & Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863
work page 2023
-
[4]
Rende, R., Gerace, F., Laio, A. & Goldt, S.A distributional simplicity bias in the learning dynamics of transformersinAdvances in Neural Information Processing Systems37(2024), 96207–96228
work page 2024
-
[5]
& Fern, X.Neural Networks Learn Statistics of Increasing Complexityin (arXiv, 2024)
Belrose, N., Pope, Q., Quirke, L., Mallen, A. & Fern, X.Neural Networks Learn Statistics of Increasing Complexityin (arXiv, 2024)
work page 2024
-
[6]
Favero, A., Sclocchi, A., Cagnetta, F., Frossard, P. & Wyart, M.How compositional generalization and creativity improve as diffusion models are trainedin (arXiv, 2025)
work page 2025
-
[7]
Garnier-Brun, J., Mézard, M., Moscato, E. & Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinInternational Conference on Machine Learning(2025)
work page 2025
-
[8]
Saad, D. & Solla, S. Exact Solution for On-Line Learning in Multilayer Neural Networks.Phys. Rev. Lett.74,4337–4340 (1995)
work page 1995
-
[9]
Saxe, A. M., McClelland, J. L. & Ganguli, S.Exact solutions to the nonlinear dynamics of learning in deep linear neural networksinICLR(2014)
work page 2014
-
[10]
Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences116,11537–11546 (2019)
work page 2019
-
[11]
Abbe, E., Boix-Adsera, E., Brennan, M. S., Bresler, G. & Nagaraj, D. The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems 34,26989–27002 (2021)
work page 2021
-
[12]
Abbe, E., Adsera, E. B. & Misiakiewicz, T.SGD learning on neural networks: leap complexity and saddle-to-saddle dynamicsinThe Thirty Sixth Annual Conference on Learning Theory(2023), 2552– 2623
work page 2023
-
[13]
Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L. How Two-Layer Neural Networks Learn, One (Giant) Step at a Time.Journal of Machine Learning Research25,1–65 (2024)
work page 2024
- [14]
-
[15]
Kögler, K., Shevchenko, A., Hassani, H. & Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinInternational Conference on Machine Learning(2024). 14
work page 2024
-
[16]
& Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018)
Farnia, F., Zhang, J. & Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018)
work page 2018
-
[17]
Rahaman, N. et al.On the Spectral Bias of Neural NetworksinInternational Conference of Machine Learning97(2019), 5301–5310
work page 2019
-
[18]
Merger, C. et al. Learning Interacting Theories from Data.Physical Review X13.Publisher: American Physical Society, 041033 (Nov. 2023)
work page 2023
-
[19]
Bardone, L. & Goldt, S.Sliding Down the Stairs: How Correlated Latent Variables Accelerate Learning with Neural NetworksinInternational Conference on Machine Learning235(PMLR, 2024), 3024–3045
work page 2024
-
[20]
Ricci, F., Bardone, L. & Goldt, S.Reduce and Conquer: Independent Component Analysis at linear sample complexityinHigh-dimensional Learning Dynamics(2025)
work page 2025
-
[21]
van der Schaaf, A. & van Hateren, J. Modelling the Power Spectra of Natural Images: Statistics and Information.Vision Research36,2759–2770 (1996)
work page 1996
-
[22]
Hyvärinen, A., Hurri, J. & Hoyer, P. O.Natural image statistics: A probabilistic approach to early computational vision.(Springer Science & Business Media, 2009)
work page 2009
- [23]
-
[24]
Piotrowski, L. & Campbell, C. A demonstration of the visual importance and flexibility of spatial- frequency amplitude and phase.Journal of Physics A: Mathematical and Theoretical53,174003 (1982)
work page 1982
-
[25]
Burghouts, G. J. & Geusebroek, J.-M. Material-specific adaptation of color invariant features. en. Pattern Recognition Letters30,306–313 (Feb. 2009)
work page 2009
-
[26]
Ben Arous, G., Gheissari, R. & Jagannath, A. Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference.J. Mach. Learn. Res.22(2021)
work page 2021
-
[27]
Ben Arous, G., Gheissari, R. & Jagannath, A.High-dimensional limit theorems for SGD: Effective dynamics and critical scalinginAdvances in Neural Information Processing Systems35(Curran Associates, Inc., 2022), 25349–25362
work page 2022
-
[28]
Pinson, H., Lenaerts, J. & Ginis, V.Linear CNNs discover the statistical structure of the dataset using only the most dominant frequenciesinInternational Conference on Machine Learning(2023), 27876–27906
work page 2023
-
[29]
Gunasekar, S., Lee, J. D., Soudry, D. & Srebro, N. Implicit bias of gradient descent on linear convolu- tional networks.Advances in neural information processing systems31(2018)
work page 2018
-
[30]
Visual Pattern Discrimination.IRE Transactions on Information Theory8,84–92 (1962)
Julesz, B. Visual Pattern Discrimination.IRE Transactions on Information Theory8,84–92 (1962)
work page 1962
-
[31]
Tkačik, G., Prentice, J. S., Victor, J. D. & Balasubramanian, V. Local statistics in natural scenes predict the saliency of synthetic textures.Proceedings of the National Academy of Sciences107, 18149–18154 (2010)
work page 2010
-
[32]
Caramellino, R. et al. Rat sensitivity to multipoint statistics is predicted by efficient coding of natural scenes.Elife10,e72081 (2021)
work page 2021
-
[33]
Geirhos, R. et al.ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustnessinInternational conference on learning representations(2018)
work page 2018
-
[34]
Paquette, E., Paquette, C., Xiao, L. & Pennington, J. 4+ 3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems37,16459–16537 (2024)
work page 2024
-
[35]
Braun, G., Loureiro, B., Minh, H. Q. & Imaizumi, M.Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Datain (arXiv, 2025)
work page 2025
-
[36]
Ben Arous, G., Erdogdu, M. A., Vural, N. M. & Wu, D.Learning quadratic neural networks in high dimensions: SGD dynamics and scaling lawsinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025). 15
work page 2025
- [37]
- [38]
- [39]
-
[40]
Dandi, Y. et al. The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents.arXiv(2024)
work page 2024
-
[41]
Gutmann, M. & Hyvärinen, A.Noise-contrastive estimation: A new estimation principle for unnormal- ized statistical modelsinProceedings of the thirteenth international conference on artificial intelligence and statistics(2010), 297–304
work page 2010
- [42]
-
[43]
Richard, E. & Montanari, A. A statistical model for tensor PCA.Advances in neural information processing systems27(2014)
work page 2014
-
[44]
Ricci, F., Bardone, L. & Goldt, S.Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensionsinInternational Conference of Machine Learning267(2025), 51614–51639
work page 2025
-
[45]
Mousavi-Hosseini, A., Wu, D., Suzuki, T. & Erdogdu, M. A. Gradient-based feature learning under structured data.Advances in Neural Information Processing Systems36,71449–71485 (2023)
work page 2023
-
[46]
Wortsman, A. & Loureiro, B. Kernel ridge regression under power-law data: spectrum and general- ization.arXiv:2510.04780(2025)
-
[47]
Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences117,30063–30070 (2020)
work page 2020
-
[48]
Cheng, C. & Montanari, A. Dimension free ridge regression.The Annals of Statistics52,2879–2912 (2024)
work page 2024
-
[49]
Field, D. J. Relations between the statistics of natural images and the response properties of cortical cells.J. Opt. Soc. Am. A4,2379–2394 (Dec. 1987)
work page 1987
-
[50]
Ben Arous, G., Gheissari, R., Huang, J. & Jagannath, A. Spectral alignment of stochastic gradient descent for high-dimensional classification tasks.The Annals of Applied Probability35,2767–2822 (2025)
work page 2025
-
[51]
Ben Arous, G., Gerbelot, C. & Piccolo, V. Stochastic gradient descent in high dimensions for multi-spiked tensor PCA.arXiv preprint arXiv:2410.18162(2024)
-
[52]
Ben Arous, G., Gheissari, R. & Jagannath, A. Algorithmic thresholds for tensor PCA.The Annals of Probability(2018)
work page 2018
-
[53]
Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature381,607–609 (1996)
work page 1996
- [54]
-
[55]
Hopkins, S.Statistical inference and the sum of squares methodPhD thesis (Cornell University, 2018)
work page 2018
-
[56]
(Academic Press, San Diego, 1999)
Mallat, S.A Wavelet Tour of Signal Processing2nd ed. (Academic Press, San Diego, 1999)
work page 1999
-
[57]
Victor JD, C. M. Local image statistics: maximum-entropy constructions and perceptual salience. Journal of the Optical Society of America A29,1313–1345 (2012). 16
work page 2012
-
[58]
Portilla, J. & Simoncelli, E. P. A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients.International Journal of Computer Vision40,49–70 (2000)
work page 2000
-
[59]
De Paolis, L., Anselmi, F., Ansuini, A. & Piasini, E. Perceptual misalignment of texture representa- tions in convolutional neural networks.arXiv preprint arXiv:2604.01341(2026)
work page internal anchor Pith review arXiv 2026
-
[60]
& Tsipras, D.Robustness (Python Library)2019
Engstrom, L., Ilyas, A., Santurkar, S. & Tsipras, D.Robustness (Python Library)2019
work page 2019
-
[61]
Simoncelli, E. & Olshausen, B. Natural Image Statistics and Neural Representation.Annual review of neuroscience24(2001)
work page 2001
-
[62]
Zhu, Z. & Wakin, M. On the Asymptotic Equivalence of Circulant and Toeplitz Matrices.IEEE Transactions on Information Theory63(2016)
work page 2016
-
[63]
On certain Hermitian forms associated with the Fourier series of a positive function
Szegö, G. On certain Hermitian forms associated with the Fourier series of a positive function. Communications in Seminars of Mathematics, University of Lund(1952)
work page 1952
-
[64]
& Silbermann, B.Analysis of Toeplitz operators(Springer-Verlag, Berlin, 1990)
Böttcher, A. & Silbermann, B.Analysis of Toeplitz operators(Springer-Verlag, Berlin, 1990)
work page 1990
- [65]
-
[66]
Szegö, G.Orthogonal Polynomials(American Mathematical Society, 1975)
work page 1975
-
[67]
Kunisky, D., Wein, A. S. & Bandeira, A. S.Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratioinInternational Congress of ISAAC (International Society for Analysis, its Applications and Computation)(2019), 1–50
work page 2019
-
[68]
cotton” (label= 1) from textures of type “lace
Isserlis, L. On a Formula for the Product-Moment Coefficient of Any Order of a Normal Frequency Distribution in Any Number of Variables.Biometrika12(1918). 17 A Experimental details In this appendix, we collect detailed information on how we ran the experiments of this paper. A.1 Figure 1 We use greyscale images from the “ALOT” dataset [25], which we down...
work page 1918
-
[69]
Similarly, cℓ 22 =E h2 v·x σC h2 u·x σB = 1 λ2 k0 E[(v·x) 2(u·x) 2]− 1 λk0 h E[(v·x) 2] +E[(u·x) 2] i + 1 = 1 λ2 k0 E[(v·x) 2(u·x) 2]−1. By exploiting the orthonormality ofuandvand Lemma C.7, we have E[(v·x) 2(u·x) 2] = N−1X k,l,m,n=0 ukulvmvnE[xkxlxmxn] =λ 2 k0 +T 4, where T4 = 2 N 4 J4(4ε)E[ρ4 k0] N−1X k,l,m,n=0 ukulvmvn cos 2πk0 N (k+l+n+m). Define now...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.