pith. sign in

arxiv: 2605.18022 · v1 · pith:4NRIEDQWnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

Pith reviewed 2026-05-20 12:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords memorizationgeneralizationlabel noiseneural networksmodular arithmeticover-parameterizationfrequency analysis
0
0 comments X

The pith

Over-parameterized models form an internal generalization structure suppressed by noisy labels that frequency-based extraction can recover for high accuracy on arithmetic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how neural networks can both memorize noisy labels and generalize to clean data in modular arithmetic problems. It shows through experiments on two-layer networks that larger models generalize better when optimized appropriately, and that noisy labels are learned faster than clean ones. The key finding is that an internal generalization structure develops but is masked by fitting the noise, and this structure can be pulled out using frequency analysis to achieve near-perfect performance even at 80 percent noise. A new partitioning method is proposed to separate generalization and memorization parts, though it underperforms the frequency approach and points to a distributed structure.

Core claim

In two-layer neural networks trained on modular arithmetic tasks with heavy label noise, over-parameterized models internally develop a generalization structure for the underlying clean task. This structure is suppressed in the output due to the requirement to fit the noisy labels. Frequency-based methods can extract this internal structure to achieve near-perfect test accuracy even with 80% label noise. A task-agnostic partitioning of the network into generalization and memorization components improves generalization but is less effective than frequency extraction, indicating the structure is spread across neurons.

What carries the argument

Frequency-based extraction of the internal generalization structure, which isolates the clean task pattern from the network's learned representations despite memorization of noise.

If this is right

  • Larger models tend to generalize better under appropriate optimization and model configurations.
  • Noisy labels are memorized faster than clean data.
  • The generalization structure is distributed across neurons rather than localized in specific components.
  • Task-agnostic partitioning into generalization and memorization subnetworks yields some improvement but is outperformed by frequency-based extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extraction techniques like frequency analysis might apply to other noisy training settings beyond arithmetic tasks.
  • The distributed character of the structure suggests that methods focused on isolating subnetworks may need further refinement to fully recover generalization.
  • Training dynamics that prioritize fast memorization of noise could be studied as a general feature of over-parameterized models.

Load-bearing premise

The frequency-based extraction method isolates a genuine generalization structure for the clean arithmetic task rather than a spurious correlation that aligns with clean test labels.

What would settle it

Training the same network architecture on labels that are purely random with no underlying modular structure and then applying the frequency extraction to see if it still produces high accuracy on the arithmetic test set would falsify the claim if accuracy remains high.

Figures

Figures reproduced from arXiv: 2605.18022 by Linyu Liu, Pinyan Lu.

Figure 1
Figure 1. Figure 1: Model-wise double descent and performance across various activation functions. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of ReLU models trained with AdamW under varying weight decay. The first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The faster memorization of noisy labels is observed in both the under- and over [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) A neuron is represented by (um, vm, wm). (b) A representative neuron from a trained ReLU model exhibits periodicity. (c) Relationship of phases under M = 1025 and α = 0.3. Each point represents a neuron, where φa, φb, and φc are the phases of u G, v G, and wG, respectively. matches the analytical solution known for quadratic activations under clean data [11] (Equation (2)). Thus, it suggests a way to s… view at source ↗
Figure 5
Figure 5. Figure 5: Left: FF for ReLU networks. The dominant-frequency component {U G, V G, WG} recovers high test accuracy (after FF), while the residual component {U R, V R, WR} mainly retains noise memorization. Right: Replacing the trained ReLU activation with quadratic or reverse ReLU preserves or even improves test accuracy and reduces noisy-label accuracy, suggesting that the learned weights already encode a rule repre… view at source ↗
Figure 6
Figure 6. Figure 6: Left: Neuron’s IPR vs. Str. on the modular addition task. Neurons with stronger periodicity also tend to have larger Str., supporting Str. as a task-agnostic proxy for neuron importance. Right: Neuron selection ratios using Str. on the modular addition task. Higher noise ratio leads to a smaller sub-network for generalization (γ G). Two sub-networks have no overlapping neurons under sufficiently large mode… view at source ↗
Figure 7
Figure 7. Figure 7: Generalization improvement on modular addition. IPR and Str. achieve similar performance. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ceilings of test accuracy for quadratic activation functions exist across varying weight [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of AdamW under 40% training ratio. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of AdamW under 60% training ratio. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance of Adam under varying weight decays. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of Muon under varying weight decays. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance on modular subtraction (left) and multiplication (right) tasks. Models are trained using AdamW under 50% training ratio. 0 50000 100000150000200000 Epoch 0 50 100 Accuracy (%) lr = 0.1, wd = 0.1 0 50000 100000150000200000 Epoch 0 50 100 Accuracy (%) lr = 0.1, wd = 0.01 0 50000 100000150000200000 Epoch 0 50 100 Accuracy (%) lr = 0.1, wd = 0.001 0 50000 100000150000200000 Epoch 0 50 100 Accuracy… view at source ↗
Figure 14
Figure 14. Figure 14: Training dynamics of SGD under different learning rates (lr) and weight decay (wd) values [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance of first-layer tied models under model misspecification. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Memorization curves for the Adam optimizer. Larger models reach high training accuracy [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Memorization curves for AdamW optimizer. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Memorization curves for Muon optimizer. clear for AdamW ( [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Training dynamics of test accuracy across various weight decays for AdamW. Small [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Training dynamics of test accuracy across various weight decays for Adam. [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Training dynamics of test accuracy across various weight decays for Muon. [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Visualization of applying frequency filtration to a neuron. [PITH_FULL_IMAGE:figures/full_fig_p021_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Performance before (dashed) and after (solid) the frequency filtration on the modular [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Performance before (dashed) and after (solid) the frequency filtration on models with [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Performance before (dashed) and after (solid) frequency filtration for models trained on the [PITH_FULL_IMAGE:figures/full_fig_p022_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Scatter plot showing the relationship of phases on modular addition tasks. Each point [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Scatter plot showing the relationship of phases on modular subtraction tasks. Each point [PITH_FULL_IMAGE:figures/full_fig_p024_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Phase MSE across different tasks, activation functions, model widths, and noise ratios. [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Each frequency ω ∈  1, · · · , P −1 2 [PITH_FULL_IMAGE:figures/full_fig_p025_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: The magnitude (norm) of each frequency ω ∈  1, · · · , P −1 2 [PITH_FULL_IMAGE:figures/full_fig_p026_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Generalization improvement across various tasks by neuron selection/pruning using IPR. [PITH_FULL_IMAGE:figures/full_fig_p026_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Generalization improvement across various tasks by neuron selection/pruning using Str. [PITH_FULL_IMAGE:figures/full_fig_p026_32.png] view at source ↗
read the original abstract

Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript investigates the coexistence of memorization and generalization in over-parameterized two-layer neural networks on modular arithmetic tasks with up to 80% label noise. Through experiments, it claims that larger models generalize better under appropriate optimization, noisy labels are memorized faster than clean ones, models internally form a generalization structure suppressed by noise fitting, and this structure can be extracted via frequency-based methods to achieve near-perfect test accuracy. A task-agnostic subnetwork partitioning method is proposed but underperforms the frequency approach, suggesting the generalization structure is distributed.

Significance. If the central empirical observations hold and the frequency extraction is shown to operate without task-specific priors, the work would offer useful evidence on how over-parameterized networks can maintain internal generalization despite heavy noise, with potential implications for model analysis techniques. The extensive experiments on arithmetic tasks and the comparison between frequency and task-agnostic methods provide concrete quantitative support, though broader applicability remains to be established.

major comments (3)
  1. [§4] §4 (Frequency-based extraction): The claim that near-perfect test accuracy reveals an internal generalization structure formed by the model depends on the frequency method operating purely on weight or activation statistics. The paper notes the task-agnostic subnetwork performs worse; this raises the possibility that frequency selection leverages known sparse Fourier properties of modular arithmetic rather than patterns learned despite noise. A concrete test (e.g., applying the same method to a non-arithmetic task or ablating task-informed frequency priors) is needed to support the interpretation.
  2. [Section 3] Experimental details (Section 3): The abstract and results report near-perfect accuracy at 80% noise, but the manuscript lacks explicit reporting of the number of random seeds, standard deviations across runs, exact train/test split ratios, and whether hyperparameter search was performed on the test set. These details are load-bearing for assessing whether the coexistence observation is robust or sensitive to post-hoc choices.
  3. [§5] §5 (Partitioning method): The task-agnostic subnetwork improves generalization but remains limited compared to frequency extraction. If the central claim is that the generalization structure is distributed across neurons, the manuscript should quantify how much of the performance gap is due to the partitioning heuristic versus inherent distribution of the structure (e.g., via neuron ablation or activation patching experiments).
minor comments (3)
  1. [Figures] Figure captions (e.g., Figure 3): Add explicit labels distinguishing memorization accuracy from generalization accuracy curves to improve readability.
  2. [Methods] Notation: The term 'frequency-based methods' is used without a precise algorithmic definition in the main text; include a short pseudocode or equation in the methods section.
  3. [Introduction] References: Add citations to prior work on Fourier analysis of modular arithmetic networks (e.g., on grokking or modular addition) to contextualize the frequency approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns on experimental reporting, the interpretation of the frequency-based extraction method, and the analysis of the task-agnostic partitioning approach. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [§4] §4 (Frequency-based extraction): The claim that near-perfect test accuracy reveals an internal generalization structure formed by the model depends on the frequency method operating purely on weight or activation statistics. The paper notes the task-agnostic subnetwork performs worse; this raises the possibility that frequency selection leverages known sparse Fourier properties of modular arithmetic rather than patterns learned despite noise. A concrete test (e.g., applying the same method to a non-arithmetic task or ablating task-informed frequency priors) is needed to support the interpretation.

    Authors: We agree that this is a substantive concern for the interpretation. The frequency method in the paper operates on the spectrum of the model's learned input-output mapping (derived from activations or weights) without injecting explicit task-specific Fourier bases as input; dominant frequencies are selected based on the model's own behavior under noise. That said, the modular arithmetic setting does make low-frequency components naturally salient for the clean function. To address the referee's request for a concrete test, the revision includes an ablation that removes any assumed frequency priors during selection and applies the same extraction procedure to a non-arithmetic task with label noise (a noisy binary classification problem on synthetic data). Results are reported in the updated Section 4 and support that the method recovers a generalizable component even when task Fourier structure is not presupposed. revision: yes

  2. Referee: [Section 3] Experimental details (Section 3): The abstract and results report near-perfect accuracy at 80% noise, but the manuscript lacks explicit reporting of the number of random seeds, standard deviations across runs, exact train/test split ratios, and whether hyperparameter search was performed on the test set. These details are load-bearing for assessing whether the coexistence observation is robust or sensitive to post-hoc choices.

    Authors: The referee is correct that these details were insufficiently reported. The revised manuscript adds a new subsection in Section 3 that explicitly states: all main results use 5 independent random seeds with standard deviations shown in tables and error bars on plots; the train/test split is 70/30 with an additional held-out validation set for hyperparameter selection; and no test-set information was used during tuning or model selection. These changes make the robustness of the reported coexistence observations verifiable. revision: yes

  3. Referee: [§5] §5 (Partitioning method): The task-agnostic subnetwork improves generalization but remains limited compared to frequency extraction. If the central claim is that the generalization structure is distributed across neurons, the manuscript should quantify how much of the performance gap is due to the partitioning heuristic versus inherent distribution of the structure (e.g., via neuron ablation or activation patching experiments).

    Authors: We accept that additional quantification is needed to separate heuristic limitations from the distributed character of the structure. In the revision we have added neuron-ablation experiments in Section 5: after identifying the generalization subnetwork via the task-agnostic method, we progressively ablate neurons within it and measure the resulting drop in extracted generalization accuracy. The results show a gradual rather than catastrophic degradation, indicating that the performance gap relative to frequency extraction is primarily attributable to the distributed nature of the structure across many neurons rather than to shortcomings of the partitioning heuristic alone. These experiments are now included with quantitative plots. revision: yes

Circularity Check

0 steps flagged

No circularity in experimental claims or derivations

full rationale

The paper is an empirical investigation relying on experiments with two-layer networks on modular arithmetic under label noise. Central claims concern observed behaviors such as faster memorization of noisy labels, better generalization in larger models, and improved test accuracy via frequency-based extraction of internal structure. No derivation chain, equations, or first-principles predictions are presented that reduce to inputs by construction, self-definition, or fitted parameters renamed as outputs. The task-agnostic subnetwork partitioning is introduced and compared directly via measured accuracies, with no load-bearing self-citations or uniqueness theorems invoked to force results. The work remains self-contained against its experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on experimental observations rather than formal axioms or derivations; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5705 in / 1115 out tokens · 26981 ms · 2026-05-20T12:24:07.099141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We perform a Fourier-based decomposition on each hidden neuron... isolate the frequency ω with the maximum magnitude... dominant-frequency sub-network achieves high test accuracy even under severe label noise

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_eq_pow echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the generalization representation of ReLU models closely matches the analytical solution (2) for modular addition... umi = λ cos(2π/P ωmi + φ(a)m)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  3. [3]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

  4. [4]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  5. [5]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  6. [6]

    Superposition Yields Robust Neural Scaling

    Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465, 2025

  7. [7]

    Do machine learning models memorize or generalize?, August 2023

    Adam Pearce, Asma Ghandeharioun, Nada Hussein, Nithum Thain, Martin Wattenberg, and Lucas Dixon. Do machine learning models memorize or generalize?, August 2023. URL https://pair.withgoogle.com/explorables/grokking/. Accessed: 2024-01-20

  8. [8]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

  9. [9]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023

  10. [10]

    The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023

    Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023

  11. [11]

    Feature emergence via margin maximization: case studies in algebraic tasks

    Depen Morwani, Benjamin L Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. InThe Twelfth International Conference on Learning Representations, 2024

  12. [12]

    URLhttps://arxiv.org/abs/2509.21519

    Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking. arXiv preprint arXiv:2509.21519, 2025

  13. [13]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  14. [14]

    Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6. 10

  15. [15]

    Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

  16. [16]

    Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022

    Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022

  17. [17]

    Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks

    Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. InInternational conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020

  18. [18]

    arXiv preprint arXiv:2301.02679 , year=

    Andrey Gromov. Grokking modular arithmetic.arXiv preprint arXiv:2301.02679, 2023

  19. [19]

    Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016

    Romualdo Pastor-Satorras and Claudio Castellano. Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016

  20. [20]

    Cambridge University Press, 2019

    Steven M Girvin and Kun Yang.Modern condensed matter physics. Cambridge University Press, 2019

  21. [21]

    To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets

    Darshil Doshi, Aritra Das, Tianyu He, and Andrey Gromov. To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets. InThe Twelfth International Conference on Learning Representations, 2024

  22. [22]

    URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1903070116

  23. [23]

    Rethinking bias-variance trade-off for generalization of neural networks

    Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. InInternational Conference on Machine Learning, pages 10767–10777. PMLR, 2020

  24. [24]

    Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020

    Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020

  25. [25]

    High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

    Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

  26. [26]

    The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization

    Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. InInternational Conference on Machine Learning, pages 74–84. PMLR, 2020

  27. [27]

    Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

    Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  28. [28]

    The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

    Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

  29. [29]

    Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

  30. [30]

    Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

    Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

  31. [31]

    Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

    Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

  32. [32]

    The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025

    Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, and Alessan- dro Laio. The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025. 11

  33. [33]

    Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022

    Matteo Gamba, Erik Englesson, Mårten Björkman, and Hossein Azizpour. Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022

  34. [34]

    Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective

    Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Bara- niuk, Micah Goldblum, and Tom Goldstein. Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 136...

  35. [35]

    arXiv preprint arXiv:2303.06173 , year=

    Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent. arXiv preprint arXiv:2303.06173, 2023

  36. [36]

    Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024

    Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024

  37. [37]

    Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

    Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

  38. [38]

    Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

    Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: grokking modular arithmetic via average gradient outer product. InForty-second International Conference on Machine Learning, 2025

  39. [39]

    Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025

    Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, and Jonathan Love. Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025

  40. [40]

    Robust training under label noise by over- parameterization

    Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over- parameterization. InInternational Conference on Machine Learning, pages 14153–14172. PMLR, 2022

  41. [41]

    Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

    Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

  42. [42]

    Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023

    Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, and Wei Hu. Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023. A Related Work Model-wise Double Descent and Over-parameterization.The discovery of the double descent phenomenon (illustrated in Figure 1a) marked a significant shift in modern machine l...

  43. [43]

    for each neuron {um,v m,w m}, there exists a scaling constant λ∈R and a frequency ω∈ 1,· · ·, P−1 2 , such that umi =λcos 2π P ωmi+φ (a) m ,(6a) vmj =λcos 2π P ωmj+φ (b) m ,(6b) wmk =λcos 2π P ωmk+φ (c) m ,(6c) for some phase offsetsφ (a) m , φ(b) m , φ(c) m ∈Rsatisfyingφ (a) m +φ (b) m =φ (c) m

  44. [44]

    For every frequencyω∈ 1,· · ·, P−1 2 , at least one neuron in the network uses this frequency. C Supplementary Experiments for Section 3 C.1 Double-descent curves on varying setups Model-wise double descent is clearly observed across different activation functions, training ratios, optimizers, and arithmetic tasks. We summarize the Figures about it in Tab...