Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

Linyu Liu; Pinyan Lu

arxiv: 2605.18022 · v1 · pith:4NRIEDQWnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

Linyu Liu , Pinyan Lu This is my paper

Pith reviewed 2026-05-20 12:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords memorizationgeneralizationlabel noiseneural networksmodular arithmeticover-parameterizationfrequency analysis

0 comments

The pith

Over-parameterized models form an internal generalization structure suppressed by noisy labels that frequency-based extraction can recover for high accuracy on arithmetic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how neural networks can both memorize noisy labels and generalize to clean data in modular arithmetic problems. It shows through experiments on two-layer networks that larger models generalize better when optimized appropriately, and that noisy labels are learned faster than clean ones. The key finding is that an internal generalization structure develops but is masked by fitting the noise, and this structure can be pulled out using frequency analysis to achieve near-perfect performance even at 80 percent noise. A new partitioning method is proposed to separate generalization and memorization parts, though it underperforms the frequency approach and points to a distributed structure.

Core claim

In two-layer neural networks trained on modular arithmetic tasks with heavy label noise, over-parameterized models internally develop a generalization structure for the underlying clean task. This structure is suppressed in the output due to the requirement to fit the noisy labels. Frequency-based methods can extract this internal structure to achieve near-perfect test accuracy even with 80% label noise. A task-agnostic partitioning of the network into generalization and memorization components improves generalization but is less effective than frequency extraction, indicating the structure is spread across neurons.

What carries the argument

Frequency-based extraction of the internal generalization structure, which isolates the clean task pattern from the network's learned representations despite memorization of noise.

If this is right

Larger models tend to generalize better under appropriate optimization and model configurations.
Noisy labels are memorized faster than clean data.
The generalization structure is distributed across neurons rather than localized in specific components.
Task-agnostic partitioning into generalization and memorization subnetworks yields some improvement but is outperformed by frequency-based extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extraction techniques like frequency analysis might apply to other noisy training settings beyond arithmetic tasks.
The distributed character of the structure suggests that methods focused on isolating subnetworks may need further refinement to fully recover generalization.
Training dynamics that prioritize fast memorization of noise could be studied as a general feature of over-parameterized models.

Load-bearing premise

The frequency-based extraction method isolates a genuine generalization structure for the clean arithmetic task rather than a spurious correlation that aligns with clean test labels.

What would settle it

Training the same network architecture on labels that are purely random with no underlying modular structure and then applying the frequency extraction to see if it still produces high accuracy on the arithmetic test set would falsify the claim if accuracy remains high.

Figures

Figures reproduced from arXiv: 2605.18022 by Linyu Liu, Pinyan Lu.

**Figure 2.** Figure 2: Performance of ReLU models trained with AdamW under varying weight decay. The first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The faster memorization of noisy labels is observed in both the under- and over [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (a) A neuron is represented by (um, vm, wm). (b) A representative neuron from a trained ReLU model exhibits periodicity. (c) Relationship of phases under M = 1025 and α = 0.3. Each point represents a neuron, where φa, φb, and φc are the phases of u G, v G, and wG, respectively. matches the analytical solution known for quadratic activations under clean data [11] (Equation (2)). Thus, it suggests a way to s… view at source ↗

**Figure 5.** Figure 5: Left: FF for ReLU networks. The dominant-frequency component {U G, V G, WG} recovers high test accuracy (after FF), while the residual component {U R, V R, WR} mainly retains noise memorization. Right: Replacing the trained ReLU activation with quadratic or reverse ReLU preserves or even improves test accuracy and reduces noisy-label accuracy, suggesting that the learned weights already encode a rule repre… view at source ↗

**Figure 6.** Figure 6: Left: Neuron’s IPR vs. Str. on the modular addition task. Neurons with stronger periodicity also tend to have larger Str., supporting Str. as a task-agnostic proxy for neuron importance. Right: Neuron selection ratios using Str. on the modular addition task. Higher noise ratio leads to a smaller sub-network for generalization (γ G). Two sub-networks have no overlapping neurons under sufficiently large mode… view at source ↗

**Figure 7.** Figure 7: Generalization improvement on modular addition. IPR and Str. achieve similar performance. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Ceilings of test accuracy for quadratic activation functions exist across varying weight [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of AdamW under 40% training ratio. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Performance of AdamW under 60% training ratio. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Performance of Adam under varying weight decays. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Performance of Muon under varying weight decays. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Performance on modular subtraction (left) and multiplication (right) tasks. Models are trained using AdamW under 50% training ratio. 0 50000 100000150000200000 Epoch 0 50 100 Accuracy (%) lr = 0.1, wd = 0.1 0 50000 100000150000200000 Epoch 0 50 100 Accuracy (%) lr = 0.1, wd = 0.01 0 50000 100000150000200000 Epoch 0 50 100 Accuracy (%) lr = 0.1, wd = 0.001 0 50000 100000150000200000 Epoch 0 50 100 Accuracy… view at source ↗

**Figure 14.** Figure 14: Training dynamics of SGD under different learning rates (lr) and weight decay (wd) values [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Performance of first-layer tied models under model misspecification. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Memorization curves for the Adam optimizer. Larger models reach high training accuracy [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Memorization curves for AdamW optimizer. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Memorization curves for Muon optimizer. clear for AdamW ( [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Training dynamics of test accuracy across various weight decays for AdamW. Small [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: Training dynamics of test accuracy across various weight decays for Adam. [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Training dynamics of test accuracy across various weight decays for Muon. [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗

**Figure 22.** Figure 22: Visualization of applying frequency filtration to a neuron. [PITH_FULL_IMAGE:figures/full_fig_p021_22.png] view at source ↗

**Figure 23.** Figure 23: Performance before (dashed) and after (solid) the frequency filtration on the modular [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

**Figure 24.** Figure 24: Performance before (dashed) and after (solid) the frequency filtration on models with [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗

**Figure 25.** Figure 25: Performance before (dashed) and after (solid) frequency filtration for models trained on the [PITH_FULL_IMAGE:figures/full_fig_p022_25.png] view at source ↗

**Figure 26.** Figure 26: Scatter plot showing the relationship of phases on modular addition tasks. Each point [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗

**Figure 27.** Figure 27: Scatter plot showing the relationship of phases on modular subtraction tasks. Each point [PITH_FULL_IMAGE:figures/full_fig_p024_27.png] view at source ↗

**Figure 28.** Figure 28: Phase MSE across different tasks, activation functions, model widths, and noise ratios. [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗

**Figure 29.** Figure 29: Each frequency ω ∈ 1, · · · , P −1 2 [PITH_FULL_IMAGE:figures/full_fig_p025_29.png] view at source ↗

**Figure 30.** Figure 30: The magnitude (norm) of each frequency ω ∈ 1, · · · , P −1 2 [PITH_FULL_IMAGE:figures/full_fig_p026_30.png] view at source ↗

**Figure 31.** Figure 31: Generalization improvement across various tasks by neuron selection/pruning using IPR. [PITH_FULL_IMAGE:figures/full_fig_p026_31.png] view at source ↗

**Figure 32.** Figure 32: Generalization improvement across various tasks by neuron selection/pruning using Str. [PITH_FULL_IMAGE:figures/full_fig_p026_32.png] view at source ↗

read the original abstract

Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses modular arithmetic with heavy label noise to show that overparameterized nets form an internal clean structure recoverable by frequency extraction, but the extraction likely benefits from task-specific Fourier priors.

read the letter

The main thing to know is that this work finds overparameterized two-layer networks on modular arithmetic can still encode the clean target function internally even at 80% label noise, and frequency-based extraction pulls it out for near-perfect test accuracy while a task-agnostic partition does noticeably worse. They also report that larger models generalize better under suitable training and that noisy labels are fit faster than clean ones. The distributed character of the structure is presented as the reason a simple subnetwork extraction falls short. These observations come from controlled experiments that track memorization and generalization separately on a synthetic task. That setup is the paper's real strength: it produces clear, quantitative patterns without the usual ambiguities of natural data. The partitioning attempt is a reasonable first cut at separating the two behaviors inside the model. The central empirical claims are stated plainly and tied to specific noise levels and accuracy numbers. The soft spot is the frequency extraction step. Because it outperforms the task-agnostic method, it is reasonable to ask whether the procedure selects or weights frequencies using knowledge of the modular function's sparse spectrum rather than reading the structure directly from weights or activations. If any task-specific tuning is involved, the high recovered accuracy is less diagnostic of an emergent internal generalization and more consistent with projection onto the known clean subspace. The abstract itself notes the performance gap, which makes the concern concrete rather than speculative. This paper is mainly for researchers who study double descent, label noise robustness, or mechanistic accounts of generalization in overparameterized models. Anyone working with synthetic tasks or internal representation analysis will get usable observations from it. The experimental design is clean enough and the questions are well-posed enough that the work deserves a serious referee, even if the frequency method needs tighter validation. I would send it to review with a request to document exactly how frequencies are chosen and whether the procedure remains effective when task knowledge is withheld.

Referee Report

3 major / 3 minor

Summary. The manuscript investigates the coexistence of memorization and generalization in over-parameterized two-layer neural networks on modular arithmetic tasks with up to 80% label noise. Through experiments, it claims that larger models generalize better under appropriate optimization, noisy labels are memorized faster than clean ones, models internally form a generalization structure suppressed by noise fitting, and this structure can be extracted via frequency-based methods to achieve near-perfect test accuracy. A task-agnostic subnetwork partitioning method is proposed but underperforms the frequency approach, suggesting the generalization structure is distributed.

Significance. If the central empirical observations hold and the frequency extraction is shown to operate without task-specific priors, the work would offer useful evidence on how over-parameterized networks can maintain internal generalization despite heavy noise, with potential implications for model analysis techniques. The extensive experiments on arithmetic tasks and the comparison between frequency and task-agnostic methods provide concrete quantitative support, though broader applicability remains to be established.

major comments (3)

[§4] §4 (Frequency-based extraction): The claim that near-perfect test accuracy reveals an internal generalization structure formed by the model depends on the frequency method operating purely on weight or activation statistics. The paper notes the task-agnostic subnetwork performs worse; this raises the possibility that frequency selection leverages known sparse Fourier properties of modular arithmetic rather than patterns learned despite noise. A concrete test (e.g., applying the same method to a non-arithmetic task or ablating task-informed frequency priors) is needed to support the interpretation.
[Section 3] Experimental details (Section 3): The abstract and results report near-perfect accuracy at 80% noise, but the manuscript lacks explicit reporting of the number of random seeds, standard deviations across runs, exact train/test split ratios, and whether hyperparameter search was performed on the test set. These details are load-bearing for assessing whether the coexistence observation is robust or sensitive to post-hoc choices.
[§5] §5 (Partitioning method): The task-agnostic subnetwork improves generalization but remains limited compared to frequency extraction. If the central claim is that the generalization structure is distributed across neurons, the manuscript should quantify how much of the performance gap is due to the partitioning heuristic versus inherent distribution of the structure (e.g., via neuron ablation or activation patching experiments).

minor comments (3)

[Figures] Figure captions (e.g., Figure 3): Add explicit labels distinguishing memorization accuracy from generalization accuracy curves to improve readability.
[Methods] Notation: The term 'frequency-based methods' is used without a precise algorithmic definition in the main text; include a short pseudocode or equation in the methods section.
[Introduction] References: Add citations to prior work on Fourier analysis of modular arithmetic networks (e.g., on grokking or modular addition) to contextualize the frequency approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns on experimental reporting, the interpretation of the frequency-based extraction method, and the analysis of the task-agnostic partitioning approach. Point-by-point responses follow.

read point-by-point responses

Referee: [§4] §4 (Frequency-based extraction): The claim that near-perfect test accuracy reveals an internal generalization structure formed by the model depends on the frequency method operating purely on weight or activation statistics. The paper notes the task-agnostic subnetwork performs worse; this raises the possibility that frequency selection leverages known sparse Fourier properties of modular arithmetic rather than patterns learned despite noise. A concrete test (e.g., applying the same method to a non-arithmetic task or ablating task-informed frequency priors) is needed to support the interpretation.

Authors: We agree that this is a substantive concern for the interpretation. The frequency method in the paper operates on the spectrum of the model's learned input-output mapping (derived from activations or weights) without injecting explicit task-specific Fourier bases as input; dominant frequencies are selected based on the model's own behavior under noise. That said, the modular arithmetic setting does make low-frequency components naturally salient for the clean function. To address the referee's request for a concrete test, the revision includes an ablation that removes any assumed frequency priors during selection and applies the same extraction procedure to a non-arithmetic task with label noise (a noisy binary classification problem on synthetic data). Results are reported in the updated Section 4 and support that the method recovers a generalizable component even when task Fourier structure is not presupposed. revision: yes
Referee: [Section 3] Experimental details (Section 3): The abstract and results report near-perfect accuracy at 80% noise, but the manuscript lacks explicit reporting of the number of random seeds, standard deviations across runs, exact train/test split ratios, and whether hyperparameter search was performed on the test set. These details are load-bearing for assessing whether the coexistence observation is robust or sensitive to post-hoc choices.

Authors: The referee is correct that these details were insufficiently reported. The revised manuscript adds a new subsection in Section 3 that explicitly states: all main results use 5 independent random seeds with standard deviations shown in tables and error bars on plots; the train/test split is 70/30 with an additional held-out validation set for hyperparameter selection; and no test-set information was used during tuning or model selection. These changes make the robustness of the reported coexistence observations verifiable. revision: yes
Referee: [§5] §5 (Partitioning method): The task-agnostic subnetwork improves generalization but remains limited compared to frequency extraction. If the central claim is that the generalization structure is distributed across neurons, the manuscript should quantify how much of the performance gap is due to the partitioning heuristic versus inherent distribution of the structure (e.g., via neuron ablation or activation patching experiments).

Authors: We accept that additional quantification is needed to separate heuristic limitations from the distributed character of the structure. In the revision we have added neuron-ablation experiments in Section 5: after identifying the generalization subnetwork via the task-agnostic method, we progressively ablate neurons within it and measure the resulting drop in extracted generalization accuracy. The results show a gradual rather than catastrophic degradation, indicating that the performance gap relative to frequency extraction is primarily attributable to the distributed nature of the structure across many neurons rather than to shortcomings of the partitioning heuristic alone. These experiments are now included with quantitative plots. revision: yes

Circularity Check

0 steps flagged

No circularity in experimental claims or derivations

full rationale

The paper is an empirical investigation relying on experiments with two-layer networks on modular arithmetic under label noise. Central claims concern observed behaviors such as faster memorization of noisy labels, better generalization in larger models, and improved test accuracy via frequency-based extraction of internal structure. No derivation chain, equations, or first-principles predictions are presented that reduce to inputs by construction, self-definition, or fitted parameters renamed as outputs. The task-agnostic subnetwork partitioning is introduced and compared directly via measured accuracies, with no load-bearing self-citations or uniqueness theorems invoked to force results. The work remains self-contained against its experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on experimental observations rather than formal axioms or derivations; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5705 in / 1115 out tokens · 26981 ms · 2026-05-20T12:24:07.099141+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We perform a Fourier-based decomposition on each hidden neuron... isolate the frequency ω with the maximum magnitude... dominant-frequency sub-network achieves high test accuracy even under severe label noise
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_eq_pow echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the generalization representation of ReLU models closely matches the analytical solution (2) for modular addition... umi = λ cos(2π/P ωmi + φ(a)m)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Superposition Yields Robust Neural Scaling

Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Do machine learning models memorize or generalize?, August 2023

Adam Pearce, Asma Ghandeharioun, Nada Hussein, Nithum Thain, Martin Wattenberg, and Lucas Dixon. Do machine learning models memorize or generalize?, August 2023. URL https://pair.withgoogle.com/explorables/grokking/. Accessed: 2024-01-20

work page 2023
[8]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[10]

The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023

Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023

work page 2023
[11]

Feature emergence via margin maximization: case studies in algebraic tasks

Depen Morwani, Benjamin L Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[12]

URLhttps://arxiv.org/abs/2509.21519

Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking. arXiv preprint arXiv:2509.21519, 2025

work page arXiv 2025
[13]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6. 10

work page 2024
[15]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021
[16]

Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022

work page 2022
[17]

Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks

Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. InInternational conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020

work page 2020
[18]

arXiv preprint arXiv:2301.02679 , year=

Andrey Gromov. Grokking modular arithmetic.arXiv preprint arXiv:2301.02679, 2023

work page arXiv 2023
[19]

Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016

Romualdo Pastor-Satorras and Claudio Castellano. Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016

work page 2016
[20]

Cambridge University Press, 2019

Steven M Girvin and Kun Yang.Modern condensed matter physics. Cambridge University Press, 2019

work page 2019
[21]

To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets

Darshil Doshi, Aritra Das, Tianyu He, and Andrey Gromov. To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[22]

URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1903070116

work page doi:10.1073/pnas.1903070116 2019
[23]

Rethinking bias-variance trade-off for generalization of neural networks

Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. InInternational Conference on Machine Learning, pages 10767–10777. PMLR, 2020

work page 2020
[24]

Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020

work page 2020
[25]

High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

work page 2020
[26]

The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization

Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. InInternational Conference on Machine Learning, pages 74–84. PMLR, 2020

work page 2020
[27]

Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[28]

The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

work page 2022
[29]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

work page 2022
[30]

Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

work page 2020
[31]

Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

work page arXiv 2003
[32]

The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025

Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, and Alessan- dro Laio. The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025. 11

work page arXiv 2025
[33]

Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022

Matteo Gamba, Erik Englesson, Mårten Björkman, and Hossein Azizpour. Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022

work page arXiv 2022
[34]

Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective

Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Bara- niuk, Micah Goldblum, and Tom Goldstein. Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 136...

work page 2022
[35]

arXiv preprint arXiv:2303.06173 , year=

Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent. arXiv preprint arXiv:2303.06173, 2023

work page arXiv 2023
[36]

Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024

Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024

work page arXiv 2024
[37]

Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

work page 2022
[38]

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: grokking modular arithmetic via average gradient outer product. InForty-second International Conference on Machine Learning, 2025

work page 2025
[39]

Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025

Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, and Jonathan Love. Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025

work page arXiv 2025
[40]

Robust training under label noise by over- parameterization

Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over- parameterization. InInternational Conference on Machine Learning, pages 14153–14172. PMLR, 2022

work page 2022
[41]

Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

work page 2022
[42]

Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023

Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, and Wei Hu. Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023. A Related Work Model-wise Double Descent and Over-parameterization.The discovery of the double descent phenomenon (illustrated in Figure 1a) marked a significant shift in modern machine l...

work page arXiv 2023
[43]

for each neuron {um,v m,w m}, there exists a scaling constant λ∈R and a frequency ω∈ 1,· · ·, P−1 2 , such that umi =λcos 2π P ωmi+φ (a) m ,(6a) vmj =λcos 2π P ωmj+φ (b) m ,(6b) wmk =λcos 2π P ωmk+φ (c) m ,(6c) for some phase offsetsφ (a) m , φ(b) m , φ(c) m ∈Rsatisfyingφ (a) m +φ (b) m =φ (c) m

work page
[44]

For every frequencyω∈ 1,· · ·, P−1 2 , at least one neuron in the network uses this frequency. C Supplementary Experiments for Section 3 C.1 Double-descent curves on varying setups Model-wise double descent is clearly observed across different activation functions, training ratios, optimizers, and arithmetic tasks. We summarize the Figures about it in Tab...

work page 2049

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[2] [2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[3] [3]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[4] [4]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Superposition Yields Robust Neural Scaling

Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Do machine learning models memorize or generalize?, August 2023

Adam Pearce, Asma Ghandeharioun, Nada Hussein, Nithum Thain, Martin Wattenberg, and Lucas Dixon. Do machine learning models memorize or generalize?, August 2023. URL https://pair.withgoogle.com/explorables/grokking/. Accessed: 2024-01-20

work page 2023

[8] [8]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[10] [10]

The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023

Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023

work page 2023

[11] [11]

Feature emergence via margin maximization: case studies in algebraic tasks

Depen Morwani, Benjamin L Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[12] [12]

URLhttps://arxiv.org/abs/2509.21519

Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking. arXiv preprint arXiv:2509.21519, 2025

work page arXiv 2025

[13] [13]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6. 10

work page 2024

[15] [15]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021

[16] [16]

Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022

work page 2022

[17] [17]

Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks

Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. InInternational conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020

work page 2020

[18] [18]

arXiv preprint arXiv:2301.02679 , year=

Andrey Gromov. Grokking modular arithmetic.arXiv preprint arXiv:2301.02679, 2023

work page arXiv 2023

[19] [19]

Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016

Romualdo Pastor-Satorras and Claudio Castellano. Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016

work page 2016

[20] [20]

Cambridge University Press, 2019

Steven M Girvin and Kun Yang.Modern condensed matter physics. Cambridge University Press, 2019

work page 2019

[21] [21]

To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets

Darshil Doshi, Aritra Das, Tianyu He, and Andrey Gromov. To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[22] [22]

URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1903070116

work page doi:10.1073/pnas.1903070116 2019

[23] [23]

Rethinking bias-variance trade-off for generalization of neural networks

Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. InInternational Conference on Machine Learning, pages 10767–10777. PMLR, 2020

work page 2020

[24] [24]

Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020

work page 2020

[25] [25]

High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

work page 2020

[26] [26]

The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization

Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. InInternational Conference on Machine Learning, pages 74–84. PMLR, 2020

work page 2020

[27] [27]

Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020

[28] [28]

The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

work page 2022

[29] [29]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

work page 2022

[30] [30]

Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

work page 2020

[31] [31]

Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

work page arXiv 2003

[32] [32]

The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025

Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, and Alessan- dro Laio. The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025. 11

work page arXiv 2025

[33] [33]

Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022

Matteo Gamba, Erik Englesson, Mårten Björkman, and Hossein Azizpour. Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022

work page arXiv 2022

[34] [34]

Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective

Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Bara- niuk, Micah Goldblum, and Tom Goldstein. Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 136...

work page 2022

[35] [35]

arXiv preprint arXiv:2303.06173 , year=

Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent. arXiv preprint arXiv:2303.06173, 2023

work page arXiv 2023

[36] [36]

Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024

Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024

work page arXiv 2024

[37] [37]

Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

work page 2022

[38] [38]

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: grokking modular arithmetic via average gradient outer product. InForty-second International Conference on Machine Learning, 2025

work page 2025

[39] [39]

Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025

Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, and Jonathan Love. Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025

work page arXiv 2025

[40] [40]

Robust training under label noise by over- parameterization

Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over- parameterization. InInternational Conference on Machine Learning, pages 14153–14172. PMLR, 2022

work page 2022

[41] [41]

Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

work page 2022

[42] [42]

Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023

Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, and Wei Hu. Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023. A Related Work Model-wise Double Descent and Over-parameterization.The discovery of the double descent phenomenon (illustrated in Figure 1a) marked a significant shift in modern machine l...

work page arXiv 2023

[43] [43]

for each neuron {um,v m,w m}, there exists a scaling constant λ∈R and a frequency ω∈ 1,· · ·, P−1 2 , such that umi =λcos 2π P ωmi+φ (a) m ,(6a) vmj =λcos 2π P ωmj+φ (b) m ,(6b) wmk =λcos 2π P ωmk+φ (c) m ,(6c) for some phase offsetsφ (a) m , φ(b) m , φ(c) m ∈Rsatisfyingφ (a) m +φ (b) m =φ (c) m

work page

[44] [44]

For every frequencyω∈ 1,· · ·, P−1 2 , at least one neuron in the network uses this frequency. C Supplementary Experiments for Section 3 C.1 Double-descent curves on varying setups Model-wise double descent is clearly observed across different activation functions, training ratios, optimizers, and arithmetic tasks. We summarize the Figures about it in Tab...

work page 2049