Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning

Albert No; Kyelim Lee; Shin So

arxiv: 2605.14659 · v1 · pith:77KEEUB7new · submitted 2026-05-14 · 💻 cs.LG

Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning

Shin So , Kyelim Lee , Albert No This is my paper

Pith reviewed 2026-06-30 21:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords grokkinggeneralizationmemorizationdataset sizetransformersalgorithmic learningstructured outputNeedleman-Wunsch

0 comments

The pith

Small transformers reach high validation accuracy fastest at an intermediate dataset size, not the largest one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Critical-data-size accounts of grokking predict that once training data suffices to identify an underlying rule, additional data should accelerate validation convergence. This paper tests the prediction in a controlled structured-output task using small transformers on Needleman-Wunsch matrix generation. It finds that high validation exact-match accuracy arrives with the fewest gradient updates at an intermediate dataset size. Past this point generalization remains possible but requires more updates, while larger data can accelerate high training accuracy once partial validation competence emerges. A multiplication baseline lacks the same post-threshold slowdown for generalization.

Core claim

In Needleman-Wunsch matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size. Beyond this sweet spot, generalization stays achievable but demands more gradient updates. In the regime where partial validation competence first appears, larger datasets instead require fewer updates to reach high training accuracy. The same post-threshold slowdown for generalization does not occur on a multiplication baseline. These observations separate the critical data size for generalization onset from the dataset size that optimizes update-based convergence and show that learning the rule and completing exact fitting can diverge in structured-o

What carries the argument

The dataset-size sweet spot for update-efficient generalization in structured-output algorithmic tasks, where validation exact-match accuracy minimizes at intermediate rather than maximal data volume.

Load-bearing premise

The Needleman-Wunsch matrix generation task and chosen transformer scale form a representative controlled setting in which post-threshold effects from critical data size should hold without confounding factors.

What would settle it

Finding that validation exact-match accuracy in the Needleman-Wunsch task converges in progressively fewer updates as dataset size grows past the reported intermediate point would falsify the claimed sweet spot.

Figures

Figures reproduced from arXiv: 2605.14659 by Albert No, Kyelim Lee, Shin So.

**Figure 2.** Figure 2: NW multi-threshold sweeps across depths [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Random-suffix probe. Each NW target is augmented with a four-bit random suffix. We [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Train–random gap diagnostic. Orange curves show NW validation accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Task-scope controls at τ = 0.98. NW, but not addition or multiplication, shows an interior validation optimum [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Seed variability for the L = 5 NW multi-threshold sweeps. Curves show the same depth sweep as [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Epoch-normalized convergence for the L = 5 NW depth sweep. The same threshold crossings are measured in epochs rather than optimizer updates. Epoch normalization emphasizes data exposure and recovers the more conventional view in which larger datasets require fewer passes over the data. This confirms that the sweet spot in the main text is specifically an update-based convergence phenomenon. D Robustness A… view at source ↗

**Figure 8.** Figure 8: Seed variability for the L = 4 NW sweeps. Shaded regions denote one standard deviation across random seeds. Exact threshold-crossing times vary, but the intermediate-data regime again shows overlap between weak validation competence and faster training-threshold crossings. 100 300 1k 3k 10k 100 300 1k 3k 10k 30k Epochs Depth 3 Accuracy Threshold =0.1 =0.2 =0.3 =0.5 =0.9 =0.98 Val (IV) Train (IT) 100 300 1k… view at source ↗

**Figure 9.** Figure 9: Epoch-normalized convergence for the L = 4 NW task. We report epochs to threshold for the same L = 4 configurations as [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Full random-suffix component trajectories. Each NW target is augmented with a four-bit [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Full dataset-size sweep for the train–random gap. For each [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Critical-data-size accounts of grokking suggest a natural post-threshold intuition: once training data is sufficient to identify the underlying rule, additional data should accelerate validation convergence. We show that this intuition can fail in a controlled structured-output task. In Needleman--Wunsch (NW) matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, not at the largest one. Past this dataset-size sweet spot, generalization remains achievable but requires more gradient updates. Conversely, in the regime where partial validation competence first appears, larger datasets can require fewer updates to reach high training accuracy, suggesting that emerging rule structure can accelerate fitting beyond example-wise memorization. A multiplication baseline does not show the same post-threshold slowdown. These results separate the critical data size for the onset of generalization from the dataset size that optimizes update-based convergence, and identify structured-output tasks where learning the rule and completing exact-fitting can diverge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports an empirical sweet spot in dataset size where small transformers reach validation accuracy fastest in the NW task, separating it from the critical size for generalization onset.

read the letter

The paper's main observation is that in Needleman-Wunsch matrix generation, small Transformers reach high validation exact-match accuracy with the fewest updates at an intermediate dataset size rather than the largest. Past that point generalization still happens but takes more steps. This runs counter to the expectation from critical-data-size accounts that extra data past the threshold should speed convergence. They also note that larger data can accelerate early training accuracy and that a multiplication baseline does not show the same slowdown.

What is new is the explicit separation between the data size needed for generalization to begin and the size that minimizes update count to convergence. The work gives a controlled example where rule learning and exact fitting diverge in their data dependence.

The results are presented as an empirical report on one task family with small models. The abstract alone supplies no error bars, seed counts, hyperparameter details, or ablations, so it is not possible to judge how robust the sweet spot is or whether the NW setup introduces special confounds. The multiplication comparison is a start but remains narrow.

This is for researchers working on grokking and data scaling in algorithmic tasks. A reader already thinking about when more data helps versus hurts learning speed would find the distinction worth checking.

The paper deserves peer review to see whether the full experiments hold up under scrutiny. I would send it to referees.

Referee Report

1 major / 0 minor

Summary. The paper claims that in the Needleman-Wunsch matrix generation task, small Transformers achieve high validation exact-match accuracy with the fewest gradient updates at an intermediate dataset size rather than the largest one, contrary to post-threshold expectations from critical-data-size accounts of grokking. It further claims that larger datasets can accelerate training accuracy (memorization) once partial validation competence emerges, while a multiplication baseline does not exhibit the same post-threshold slowdown in generalization.

Significance. If the empirical observation holds under controlled conditions, the result would separate the critical data size required for the onset of generalization from the dataset size that minimizes the number of updates needed for convergence. This distinction could refine understanding of when rule structure accelerates fitting versus when additional data impedes update-efficient generalization in structured-output algorithmic tasks.

major comments (1)

The provided abstract and reader's assessment indicate that experimental details (dataset size ranges, NW task generation procedure, Transformer hyperparameters, number of runs, and error bars) are not available for verification; without these, the robustness of the reported sweet spot cannot be assessed and the central empirical claim remains unverified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their assessment and the opportunity to clarify the experimental details supporting our central claims. We address the major comment below.

read point-by-point responses

Referee: The provided abstract and reader's assessment indicate that experimental details (dataset size ranges, NW task generation procedure, Transformer hyperparameters, number of runs, and error bars) are not available for verification; without these, the robustness of the reported sweet spot cannot be assessed and the central empirical claim remains unverified.

Authors: The referee correctly notes that the abstract omits these specifics, which is standard for abstracts. The full manuscript contains a dedicated Experimental Setup section (Section 3) and Appendix that specify: dataset sizes spanning multiple orders of magnitude with the intermediate sweet spot identified; the NW matrix generation procedure via the standard dynamic-programming recurrence on random input sequences; the small Transformer architecture and training hyperparameters; the number of independent runs; and error bars on all reported curves. These elements are also referenced in the figure captions and results section. We believe this information suffices for verification and reproduction of the reported sweet spot and the contrast with the multiplication baseline. If any aspect remains unclear, we are happy to expand or clarify further in revision. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical report on dataset-size effects in Transformer training for a structured-output task. The abstract describes observational results on validation accuracy convergence rates at different data scales, with no equations, derivations, fitted parameters presented as predictions, or self-citations invoked as load-bearing uniqueness theorems. No steps reduce by construction to inputs; the central claim is a direct experimental finding on a specific algorithmic task and baseline comparison, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical investigation; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5689 in / 984 out tokens · 24854 ms · 2026-06-30T21:23:35.251402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175,

Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175,

work page arXiv
[4]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[5]

Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381, 2023

Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381,

work page arXiv
[6]

Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S Du, Jason D Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

work page arXiv
[7]

How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025

John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832,

work page arXiv
[8]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2309.02390 , year=

Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

work page arXiv
[10]

Critical data size of language models from a grokking perspective.arXiv preprint arXiv:2401.10463,

Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin. Critical data size of language models from a grokking perspective.arXiv preprint arXiv:2401.10463,

work page arXiv

[1] [1]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175,

Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175,

work page arXiv

[4] [4]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[5] [5]

Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381, 2023

Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381,

work page arXiv

[6] [6]

Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S Du, Jason D Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

work page arXiv

[7] [7]

How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025

John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832,

work page arXiv

[8] [8]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2309.02390 , year=

Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

work page arXiv

[10] [10]

Critical data size of language models from a grokking perspective.arXiv preprint arXiv:2401.10463,

Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin. Critical data size of language models from a grokking perspective.arXiv preprint arXiv:2401.10463,

work page arXiv