Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise
Pith reviewed 2026-05-20 12:24 UTC · model grok-4.3
The pith
Over-parameterized models form an internal generalization structure suppressed by noisy labels that frequency-based extraction can recover for high accuracy on arithmetic tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In two-layer neural networks trained on modular arithmetic tasks with heavy label noise, over-parameterized models internally develop a generalization structure for the underlying clean task. This structure is suppressed in the output due to the requirement to fit the noisy labels. Frequency-based methods can extract this internal structure to achieve near-perfect test accuracy even with 80% label noise. A task-agnostic partitioning of the network into generalization and memorization components improves generalization but is less effective than frequency extraction, indicating the structure is spread across neurons.
What carries the argument
Frequency-based extraction of the internal generalization structure, which isolates the clean task pattern from the network's learned representations despite memorization of noise.
If this is right
- Larger models tend to generalize better under appropriate optimization and model configurations.
- Noisy labels are memorized faster than clean data.
- The generalization structure is distributed across neurons rather than localized in specific components.
- Task-agnostic partitioning into generalization and memorization subnetworks yields some improvement but is outperformed by frequency-based extraction.
Where Pith is reading between the lines
- Extraction techniques like frequency analysis might apply to other noisy training settings beyond arithmetic tasks.
- The distributed character of the structure suggests that methods focused on isolating subnetworks may need further refinement to fully recover generalization.
- Training dynamics that prioritize fast memorization of noise could be studied as a general feature of over-parameterized models.
Load-bearing premise
The frequency-based extraction method isolates a genuine generalization structure for the clean arithmetic task rather than a spurious correlation that aligns with clean test labels.
What would settle it
Training the same network architecture on labels that are purely random with no underlying modular structure and then applying the frequency extraction to see if it still produces high accuracy on the arithmetic test set would falsify the claim if accuracy remains high.
Figures
read the original abstract
Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates the coexistence of memorization and generalization in over-parameterized two-layer neural networks on modular arithmetic tasks with up to 80% label noise. Through experiments, it claims that larger models generalize better under appropriate optimization, noisy labels are memorized faster than clean ones, models internally form a generalization structure suppressed by noise fitting, and this structure can be extracted via frequency-based methods to achieve near-perfect test accuracy. A task-agnostic subnetwork partitioning method is proposed but underperforms the frequency approach, suggesting the generalization structure is distributed.
Significance. If the central empirical observations hold and the frequency extraction is shown to operate without task-specific priors, the work would offer useful evidence on how over-parameterized networks can maintain internal generalization despite heavy noise, with potential implications for model analysis techniques. The extensive experiments on arithmetic tasks and the comparison between frequency and task-agnostic methods provide concrete quantitative support, though broader applicability remains to be established.
major comments (3)
- [§4] §4 (Frequency-based extraction): The claim that near-perfect test accuracy reveals an internal generalization structure formed by the model depends on the frequency method operating purely on weight or activation statistics. The paper notes the task-agnostic subnetwork performs worse; this raises the possibility that frequency selection leverages known sparse Fourier properties of modular arithmetic rather than patterns learned despite noise. A concrete test (e.g., applying the same method to a non-arithmetic task or ablating task-informed frequency priors) is needed to support the interpretation.
- [Section 3] Experimental details (Section 3): The abstract and results report near-perfect accuracy at 80% noise, but the manuscript lacks explicit reporting of the number of random seeds, standard deviations across runs, exact train/test split ratios, and whether hyperparameter search was performed on the test set. These details are load-bearing for assessing whether the coexistence observation is robust or sensitive to post-hoc choices.
- [§5] §5 (Partitioning method): The task-agnostic subnetwork improves generalization but remains limited compared to frequency extraction. If the central claim is that the generalization structure is distributed across neurons, the manuscript should quantify how much of the performance gap is due to the partitioning heuristic versus inherent distribution of the structure (e.g., via neuron ablation or activation patching experiments).
minor comments (3)
- [Figures] Figure captions (e.g., Figure 3): Add explicit labels distinguishing memorization accuracy from generalization accuracy curves to improve readability.
- [Methods] Notation: The term 'frequency-based methods' is used without a precise algorithmic definition in the main text; include a short pseudocode or equation in the methods section.
- [Introduction] References: Add citations to prior work on Fourier analysis of modular arithmetic networks (e.g., on grokking or modular addition) to contextualize the frequency approach.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns on experimental reporting, the interpretation of the frequency-based extraction method, and the analysis of the task-agnostic partitioning approach. Point-by-point responses follow.
read point-by-point responses
-
Referee: [§4] §4 (Frequency-based extraction): The claim that near-perfect test accuracy reveals an internal generalization structure formed by the model depends on the frequency method operating purely on weight or activation statistics. The paper notes the task-agnostic subnetwork performs worse; this raises the possibility that frequency selection leverages known sparse Fourier properties of modular arithmetic rather than patterns learned despite noise. A concrete test (e.g., applying the same method to a non-arithmetic task or ablating task-informed frequency priors) is needed to support the interpretation.
Authors: We agree that this is a substantive concern for the interpretation. The frequency method in the paper operates on the spectrum of the model's learned input-output mapping (derived from activations or weights) without injecting explicit task-specific Fourier bases as input; dominant frequencies are selected based on the model's own behavior under noise. That said, the modular arithmetic setting does make low-frequency components naturally salient for the clean function. To address the referee's request for a concrete test, the revision includes an ablation that removes any assumed frequency priors during selection and applies the same extraction procedure to a non-arithmetic task with label noise (a noisy binary classification problem on synthetic data). Results are reported in the updated Section 4 and support that the method recovers a generalizable component even when task Fourier structure is not presupposed. revision: yes
-
Referee: [Section 3] Experimental details (Section 3): The abstract and results report near-perfect accuracy at 80% noise, but the manuscript lacks explicit reporting of the number of random seeds, standard deviations across runs, exact train/test split ratios, and whether hyperparameter search was performed on the test set. These details are load-bearing for assessing whether the coexistence observation is robust or sensitive to post-hoc choices.
Authors: The referee is correct that these details were insufficiently reported. The revised manuscript adds a new subsection in Section 3 that explicitly states: all main results use 5 independent random seeds with standard deviations shown in tables and error bars on plots; the train/test split is 70/30 with an additional held-out validation set for hyperparameter selection; and no test-set information was used during tuning or model selection. These changes make the robustness of the reported coexistence observations verifiable. revision: yes
-
Referee: [§5] §5 (Partitioning method): The task-agnostic subnetwork improves generalization but remains limited compared to frequency extraction. If the central claim is that the generalization structure is distributed across neurons, the manuscript should quantify how much of the performance gap is due to the partitioning heuristic versus inherent distribution of the structure (e.g., via neuron ablation or activation patching experiments).
Authors: We accept that additional quantification is needed to separate heuristic limitations from the distributed character of the structure. In the revision we have added neuron-ablation experiments in Section 5: after identifying the generalization subnetwork via the task-agnostic method, we progressively ablate neurons within it and measure the resulting drop in extracted generalization accuracy. The results show a gradual rather than catastrophic degradation, indicating that the performance gap relative to frequency extraction is primarily attributable to the distributed nature of the structure across many neurons rather than to shortcomings of the partitioning heuristic alone. These experiments are now included with quantitative plots. revision: yes
Circularity Check
No circularity in experimental claims or derivations
full rationale
The paper is an empirical investigation relying on experiments with two-layer networks on modular arithmetic under label noise. Central claims concern observed behaviors such as faster memorization of noisy labels, better generalization in larger models, and improved test accuracy via frequency-based extraction of internal structure. No derivation chain, equations, or first-principles predictions are presented that reduce to inputs by construction, self-definition, or fitted parameters renamed as outputs. The task-agnostic subnetwork partitioning is introduced and compared directly via measured accuracies, with no load-bearing self-citations or uniqueness theorems invoked to force results. The work remains self-contained against its experimental benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We perform a Fourier-based decomposition on each hidden neuron... isolate the frequency ω with the maximum magnitude... dominant-frequency sub-network achieves high test accuracy even under severe label noise
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_eq_pow echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the generalization representation of ReLU models closely matches the analytical solution (2) for modular addition... umi = λ cos(2π/P ωmi + φ(a)m)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[2]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[3]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Superposition Yields Robust Neural Scaling
Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Do machine learning models memorize or generalize?, August 2023
Adam Pearce, Asma Ghandeharioun, Nada Hussein, Nithum Thain, Martin Wattenberg, and Lucas Dixon. Do machine learning models memorize or generalize?, August 2023. URL https://pair.withgoogle.com/explorables/grokking/. Accessed: 2024-01-20
work page 2023
-
[8]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[10]
Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.Advances in neural information processing systems, 36:27223–27250, 2023
work page 2023
-
[11]
Feature emergence via margin maximization: case studies in algebraic tasks
Depen Morwani, Benjamin L Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[12]
URLhttps://arxiv.org/abs/2509.21519
Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking. arXiv preprint arXiv:2509.21519, 2025
-
[13]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6. 10
work page 2024
-
[15]
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021
work page 2021
-
[16]
Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models.Advances in Neural Information Processing Systems, 35:38274–38290, 2022
work page 2022
-
[17]
Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. InInternational conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020
work page 2020
-
[18]
arXiv preprint arXiv:2301.02679 , year=
Andrey Gromov. Grokking modular arithmetic.arXiv preprint arXiv:2301.02679, 2023
-
[19]
Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016
Romualdo Pastor-Satorras and Claudio Castellano. Distinct types of eigenvector localization in networks.Scientific reports, 6(1):18847, 2016
work page 2016
-
[20]
Cambridge University Press, 2019
Steven M Girvin and Kun Yang.Modern condensed matter physics. Cambridge University Press, 2019
work page 2019
-
[21]
Darshil Doshi, Aritra Das, Tianyu He, and Andrey Gromov. To grok or not to grok: Disen- tangling generalization and memorization on corrupted algorithmic datasets. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[22]
URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1903070116
-
[23]
Rethinking bias-variance trade-off for generalization of neural networks
Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. InInternational Conference on Machine Learning, pages 10767–10777. PMLR, 2020
work page 2020
-
[24]
Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020
work page 2020
-
[25]
Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020
work page 2020
-
[26]
Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. InInternational Conference on Machine Learning, pages 74–84. PMLR, 2020
work page 2020
-
[27]
Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
work page 2020
-
[28]
Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022
work page 2022
-
[29]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022
work page 2022
-
[30]
Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020
work page 2020
-
[31]
Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020
Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020
-
[32]
Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, and Alessan- dro Laio. The effect of label noise on the information content of neural representations.arXiv preprint arXiv:2510.06401, 2025. 11
-
[33]
Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022
Matteo Gamba, Erik Englesson, Mårten Björkman, and Hossein Azizpour. Deep double descent via smooth interpolation.arXiv preprint arXiv:2209.10080, 2022
-
[34]
Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Bara- niuk, Micah Goldblum, and Tom Goldstein. Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 136...
work page 2022
-
[35]
arXiv preprint arXiv:2303.06173 , year=
Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent. arXiv preprint arXiv:2303.06173, 2023
-
[36]
Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175, 2024
-
[37]
Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022
work page 2022
-
[38]
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: grokking modular arithmetic via average gradient outer product. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[39]
Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, and Jonathan Love. Uncovering a universal abstract algorithm for modular addition in neural networks.arXiv preprint arXiv:2505.18266, 2025
-
[40]
Robust training under label noise by over- parameterization
Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over- parameterization. InInternational Conference on Machine Learning, pages 14153–14172. PMLR, 2022
work page 2022
-
[41]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022
work page 2022
-
[42]
Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, and Wei Hu. Benign overfitting and grokking in relu networks for xor cluster data.arXiv preprint arXiv:2310.02541, 2023. A Related Work Model-wise Double Descent and Over-parameterization.The discovery of the double descent phenomenon (illustrated in Figure 1a) marked a significant shift in modern machine l...
-
[43]
for each neuron {um,v m,w m}, there exists a scaling constant λ∈R and a frequency ω∈ 1,· · ·, P−1 2 , such that umi =λcos 2π P ωmi+φ (a) m ,(6a) vmj =λcos 2π P ωmj+φ (b) m ,(6b) wmk =λcos 2π P ωmk+φ (c) m ,(6c) for some phase offsetsφ (a) m , φ(b) m , φ(c) m ∈Rsatisfyingφ (a) m +φ (b) m =φ (c) m
-
[44]
For every frequencyω∈ 1,· · ·, P−1 2 , at least one neuron in the network uses this frequency. C Supplementary Experiments for Section 3 C.1 Double-descent curves on varying setups Model-wise double descent is clearly observed across different activation functions, training ratios, optimizers, and arithmetic tasks. We summarize the Figures about it in Tab...
work page 2049
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.