arxiv: 2604.13082 · v1 · submitted 2026-03-30 · 💻 cs.LG · cs.AI

Recognition: unknown

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Laura Gomezjurado Gonzalez

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords grokkingarithmetic generalizationencoder-decoder modelsCollatz predictiondecoder bottlenecknumeral representationsalgorithmic tasks

0 comments

The pith

In encoder-decoder arithmetic models the grokking delay stems from decoder inability to use structure the encoder has already learned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why transformers show a long plateau between memorizing training data and generalizing on algorithmic tasks like one-step Collatz prediction. It finds that the encoder quickly organizes parity and residue structure, yet output accuracy stays near chance for many more steps. Interventions show the decoder is the limiting factor: inserting a trained encoder speeds generalization by 2.75 times, while freezing a converged encoder and retraining only the decoder removes the plateau and reaches 97.6 percent accuracy. The choice of numeral base further modulates decoder access, with bases whose factorization aligns to the task yielding near-perfect results and binary collapsing entirely.

Core claim

The long delay to generalization reflects limited decoder access to already learned structure rather than failure to acquire that structure. A trained encoder organizes parity and residue information within the first few thousand steps while accuracy remains near chance; transplanting such an encoder accelerates grokking, transplanting a trained decoder hurts performance, and freezing a converged encoder while retraining the decoder alone eliminates the plateau and produces 97.6 percent accuracy versus 86.1 percent for joint training. Across fifteen bases, those aligned with the Collatz map reach 99.8 percent accuracy while binary representations collapse and never recover.

What carries the argument

Decoder access to encoder representations, tested through transplant and freezing interventions that isolate whether the decoder can exploit pre-organized structure.

If this is right

Transplanting a trained encoder accelerates grokking by a factor of 2.75.
Freezing a converged encoder and retraining only the decoder removes the plateau and reaches 97.6 percent accuracy.
Bases whose factorization aligns with the Collatz map produce 99.8 percent accuracy while binary representations fail completely.
The decoder's difficulty depends on how much local digit structure the chosen numeral representation supplies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separate pre-training of encoders followed by decoder fine-tuning could shorten training on other algorithmic tasks.
The same access bottleneck may explain grokking delays in models that separate representation learning from output mapping.
Choosing input representations that align with task arithmetic offers a practical inductive bias for faster generalization.

Load-bearing premise

The causal effects seen in one-step Collatz prediction and base comparison isolate decoder access as the cause of the delay without confounding influences from architecture, optimization, or task details.

What would settle it

Training the same tasks with a non-encoder-decoder architecture or with an encoder whose representations are deliberately scrambled after early training and checking whether the plateau disappears or persists.

Figures

Figures reproduced from arXiv: 2604.13082 by Laura Gomezjurado Gonzalez.

**Figure 1.** Figure 1: The encoder becomes informative long before accuracy improves. (A) Sequence accuracy on one-step Collatz prediction in base 8 remains low until a late jump around step 44k, with even-branch examples learned earlier than odd-branch examples. (B) Linear probes recover parity from encoder states by about step 2k across layers. The shaded region marks the shadow knowledge gap: information is available in the e… view at source ↗

**Figure 2.** Figure 2: The same integer can induce different decoder readout problems across bases. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Encoder transplant strongly accelerates grokking, whereas decoder transplant does not. Lines show mean accuracy across 3 random seeds; shaded ribbons denote ±1 standard deviation. Top: overall accuracy. A trained, frozen encoder paired with a fresh decoder (enc. transplant) reaches 70% accuracy 2.75× earlier than joint training from scratch and converges to a higher final level. By contrast, a trained, fro… view at source ↗

**Figure 4.** Figure 4: Modular structure becomes linearly decodable early in the encoder. (A) Probe accuracy at layer 5 for targets modulo 2, 4, 8, and 16 over training (log-scale steps). All probes approach ceiling by about step 2k, with coarser targets learned slightly earlier. (B) Probe accuracy at step 2k across encoder layers and modular targets. Accuracy is high across layers, though mod 16 is slightly weaker in earlier la… view at source ↗

**Figure 5.** Figure 5: Rewinding only the decoder largely removes the grokking plateau. Curves show mean accuracy across 3 random seeds; shaded ribbons denote ±1 standard deviation. (A) Starting from a converged model, we freeze the encoder, rewind the decoder to its step-2k weights, and resume training. The rewound model improves immediately, whereas training from scratch remains on a prolonged plateau before grokking. (B) An e… view at source ↗

**Figure 6.** Figure 6: Erasing the encoder’s parity direction selectively impairs performance during the plateau. (A) At each checkpoint, we remove the learned parity direction from encoder hidden states at inference time and compare performance with the unmodified model. The erased model underperforms the baseline most strongly before grokking, even though parity probe accuracy is already near ceiling. (B) The effect of parity … view at source ↗

**Figure 7.** Figure 7: Decoder depth and carry exposure shape decoder learning. (A) Odd-branch accuracy over 500k steps for decoder depths 1, 2, 4, and 6, with the encoder fixed at 6 layers. Performance is non-monotonic: depth 4 performs best at convergence, depth 1 learns fastest early and finishes close behind, and depths 2 and 6 lag throughout. (B) Even (dashed) and odd (solid) accuracy under three training distributions. Car… view at source ↗

**Figure 8.** Figure 8: Per-seed training trajectories for the three transplant conditions. Each panel shows the three individual training runs (solid, dashed, dotted lines) together with their mean (thicker solid line), for overall sequence-level accuracy over 200k steps. Left: encoder transplant, all three seeds show rapid early progress with tight agreement, confirming the speedup in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: The encoder resolves residue structure coarse-to-fine. Conditional probe accuracy measures how much information the final encoder layer carries about each level of the 2-adic filtration, given that all coarser levels are known. (A) Probes k2 (mod 4|mod 2) and k3 (mod 8|mod 4) saturate above 0.99 from step 2k, while k4 (mod 16|mod 8) starts at 0.676 and rises slowly, peaking near 0.96 around step 130k. Th… view at source ↗

read the original abstract

Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core finding is that grokking delays in these encoder-decoder arithmetic models come from decoder access limits to early encoder structure, shown by transplant and freezing interventions plus the base sweep.

read the letter

The paper's main result is that in encoder-decoder models on one-step Collatz, the encoder picks up parity and residue structure early while the decoder stays near chance for much longer. Transplanting a trained encoder speeds up grokking by 2.75 times, and freezing a converged encoder while retraining only the decoder removes the plateau and reaches 97.6% accuracy versus 86.1% for joint training. The 15-base comparison adds that factorization alignment with the task controls how much the decoder can exploit, with some bases hitting 99.8% and binary failing outright because representations collapse. These interventions give direct evidence for the access-bottleneck account and tie numeral choice to learnability in a way prior grokking work did not isolate as cleanly. The base results in particular feel like a useful inductive-bias angle. The main soft spots are reporting and controls. The abstract gives no error bars, statistical tests, or full curves, so the accuracy lifts could be less stable than they appear. Freezing the encoder also removes gradient co-adaptation, which can flatten the loss landscape on its own even without task-relevant structure in the encoder; the transplant result would need a matched initialization control to pin the effect on access rather than optimization ease. The 15-base findings show representation sensitivity but do not fully separate decoder access from those dynamics. This is for people studying grokking and generalization on algorithmic tasks in transformers. A reader focused on causal interventions or representation effects would get concrete value from the experiments. It has enough new empirical moves to deserve peer review, though the authors should add the missing stats and an optimization-matched ablation to tighten the causal claim.

Referee Report

3 major / 2 minor

Summary. The paper examines grokking in encoder-decoder transformers on algorithmic tasks such as one-step Collatz prediction. It claims that the long delay between training-set fit and generalization arises because the decoder has limited access to structure already learned in the encoder, rather than the encoder failing to acquire that structure. Causal support comes from interventions: transplanting a trained encoder accelerates grokking by 2.75 times, transplanting a trained decoder hurts performance, and freezing a converged encoder while retraining only the decoder eliminates the plateau and reaches 97.6% accuracy versus 86.1% for joint training. Across 15 numeral bases, factorization alignment with the Collatz map controls decoder exploitability, with base 24 reaching 99.8% and binary failing entirely.

Significance. If the causal attribution to decoder access holds after controls, the work supplies direct empirical evidence distinguishing representation acquisition from output-head access in grokking, with implications for training algorithmic models. The interventions are stronger than purely observational analyses, and the base-sweep demonstrates a clear inductive-bias effect on learnability from the same underlying task.

major comments (3)

[Freezing intervention (results section)] Freezing experiment: the claim that freezing a converged encoder isolates decoder access (yielding 97.6% vs 86.1%) is load-bearing for the central hypothesis, yet fixing encoder parameters also removes all encoder-decoder gradient co-adaptation. This change in optimization dynamics can independently eliminate plateaus even if the encoder contains no task-relevant structure; an ablation that holds optimization fixed while varying only the informational content of the encoder is required to separate the two effects.
[Encoder transplant results] Transplant experiment: the reported 2.75x acceleration from transplanting a trained encoder lacks a matched control for initialization quality (e.g., a random encoder with matched norm or activation statistics). Without this, the speedup cannot be unambiguously attributed to transfer of learned structure rather than a generically better starting point.
[Quantitative results throughout] Accuracy and statistics: the headline numbers (97.6%, 86.1%, 2.75x) and the base-15 comparison are presented without error bars, statistical tests, or full training curves. This leaves open the possibility of post-hoc run selection and makes it impossible to judge whether the reported differences are robust.

minor comments (2)

[Abstract] The abstract states the encoder organizes parity/residue structure 'within the first few thousand training steps' but supplies no concrete step counts or plots; adding approximate numbers and a reference to the relevant figure would improve clarity.
[Base comparison section] Notation for the 15-base sweep (e.g., how 'factorization aligns with the Collatz map') is introduced without an explicit definition or table; a short appendix table listing each base and its alignment score would help readers replicate the inductive-bias claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key controls needed to strengthen the causal interpretation of the decoder bottleneck. We address each major point below with planned revisions that incorporate additional ablations, matched controls, and statistical reporting. These changes will be included in the revised manuscript.

read point-by-point responses

Referee: [Freezing intervention (results section)] Freezing experiment: the claim that freezing a converged encoder isolates decoder access (yielding 97.6% vs 86.1%) is load-bearing for the central hypothesis, yet fixing encoder parameters also removes all encoder-decoder gradient co-adaptation. This change in optimization dynamics can independently eliminate plateaus even if the encoder contains no task-relevant structure; an ablation that holds optimization fixed while varying only the informational content of the encoder is required to separate the two effects.

Authors: We agree that freezing removes gradient co-adaptation and that this could independently affect plateau length. To isolate the contribution of encoder content, we will add a control ablation in the revision: a randomly initialized encoder (with activation statistics and norms matched to the converged encoder) that is frozen while the decoder is retrained from scratch. This holds optimization dynamics fixed while varying only informational content. We will report the resulting accuracy (expected ~55-65%) alongside the original 97.6% figure, with error bars over multiple seeds, to confirm that learned structure—not the freezing procedure—is responsible for eliminating the plateau. revision: yes
Referee: [Encoder transplant results] Transplant experiment: the reported 2.75x acceleration from transplanting a trained encoder lacks a matched control for initialization quality (e.g., a random encoder with matched norm or activation statistics). Without this, the speedup cannot be unambiguously attributed to transfer of learned structure rather than a generically better starting point.

Authors: We acknowledge that a matched initialization control is necessary. In the revised version we will include results from transplanting a random encoder whose weight norms and per-layer activation statistics are matched to those of the trained encoder at the transplant step. This control will be run under identical training schedules; we expect it to produce little or no acceleration relative to the standard baseline, thereby attributing the observed 2.75x speedup specifically to transferred structure rather than generic initialization quality. revision: yes
Referee: [Quantitative results throughout] Accuracy and statistics: the headline numbers (97.6%, 86.1%, 2.75x) and the base-15 comparison are presented without error bars, statistical tests, or full training curves. This leaves open the possibility of post-hoc run selection and makes it impossible to judge whether the reported differences are robust.

Authors: We will revise all quantitative claims to include error bars computed over at least five independent random seeds. Full training curves for representative runs will be added to the appendix, and statistical significance (paired t-tests) will be reported for the key comparisons (frozen vs. joint training, trained vs. random transplant, and base-24 vs. binary). These updates will apply to the headline metrics, the base-sweep results, and all intervention figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical interventions on measured accuracies

full rationale

The paper reports direct experimental results from training encoder-decoder transformers on one-step Collatz prediction and 15-base variants. Core claims rest on observed training curves, accuracy numbers (e.g., 97.6% vs 86.1%), and intervention outcomes (transplant acceleration by 2.75x, freezing eliminating plateau). No derivation chain, fitted parameters renamed as predictions, or self-citations are invoked to justify the decoder-bottleneck interpretation. The measurements are independent of the hypothesis; the paper does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about the separability of encoder and decoder components and the validity of causal interventions in trained transformers.

axioms (1)

domain assumption Encoder and decoder components can be independently transplanted or frozen without introducing uncontrolled side effects on optimization dynamics
Invoked by the transplant and freezing experiments described in the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1265 out tokens · 61441 ms · 2026-05-14T21:12:43.869799+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 7 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

URLhttps://arxiv.org/abs/1610.01644. ICLR 2017 Workshop. Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimen- sion of data representations in deep neural networks,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

NeurIPS 2019; arXiv preprint matches proceedings

URL https://arxiv.org/ abs/1905.12784. NeurIPS 2019; arXiv preprint matches proceedings. Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit,

work page arXiv 1905
[3]

URLhttps://arxiv.org/abs/2207.08799. NeurIPS

work page arXiv
[4]

URL https://arxiv.org/abs/2212.03827. ICLR

work page arXiv
[5]

URL https://arxiv.org/abs/ 2112.01898. TMLR. Franc ¸ois Charton. Learning the greatest common divisor: explaining transformer predic- tions,

work page arXiv
[6]

Franc ¸ois Charton and Ashvni Narayanan

URLhttps://arxiv.org/abs/2308.15594. Franc ¸ois Charton and Ashvni Narayanan. Transformers know more than they can tell – learning the collatz sequence,

work page arXiv
[7]

Bilal Chughtai, Lawrence Chan, and Neel Nanda

URLhttps://arxiv.org/abs/2511.10811. Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations,

work page arXiv
[8]

Samy Jelassi, St ´ephane d’Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, and Franc ¸ois Charton

URL https://arxiv.org/abs/ 2302.03025. Samy Jelassi, St ´ephane d’Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, and Franc ¸ois Charton. Length generalization in arithmetic transformers,

work page arXiv
[9]

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian

URL https: //arxiv.org/abs/2306.15400. Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning,

work page arXiv
[10]

URLhttps://arxiv.org/abs/1905.00414. ICML

work page internal anchor Pith review Pith/arXiv arXiv 1905
[11]

Guillaume Lample and Franc ¸ois Charton

URL https://arxiv.org/abs/ 2310.06110. Guillaume Lample and Franc ¸ois Charton. Deep learning for symbolic mathematics,

work page arXiv
[12]

Lample and F

URLhttps://arxiv.org/abs/1912.01412. 10 Preprint. Under review. Anna Langedijk, Hosein Mohebbi, Gabriele Sarti, Willem Zuidema, and Jaap Jumelet. DecoderLens: Layerwise interpretation of encoder-decoder transformers. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Findings of the Association for Computational Linguistics: NAACL 2024, pp. 4764–4780, ...

work page arXiv 1912
[13]

doi: 10.18653/v1/2024.findings-naacl.296

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.296. URL https: //aclanthology.org/2024.findings-naacl.296/. Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teaching arithmetic to small transformers,

work page doi:10.18653/v1/2024.findings-naacl.296 2024
[14]

URLhttps://arxiv.org/abs/2205.10343. NeurIPS

work page arXiv
[15]

URLhttps://arxiv.org/abs/2210.01117. ICLR

work page arXiv
[16]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

URL https://arxiv.org/ abs/2310.06824. William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

URLhttps://arxiv.org/abs/2303.13506. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability,

work page arXiv
[18]

URL https://arxiv.org/ abs/2301.05217. ICLR

work page internal anchor Pith review arXiv
[19]

URLhttps://arxiv.org/abs/2010.15327. ICLR

work page arXiv 2010
[20]

Investigating the limitations of transformers with simple arithmetic tasks

URLhttps://arxiv.org/abs/2102.13019. Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets,

work page arXiv
[21]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

URL https: //arxiv.org/abs/2201.02177. ICLR Workshop

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein

URL https: //arxiv.org/abs/2310.13121. Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,

work page arXiv
[23]

URLhttps://arxiv.org/abs/1706.05806. NeurIPS

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan

URL https://arxiv.org/ abs/2311.14737. Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 8111–8128, Suzhou, China, November

work page arXiv 2025
[25]

emnlp-main.184/

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[26]

Ian Tenney, Dipanjan Das, and Ellie Pavlick

URLhttps://aclanthology.org/2025.emnlp-main.411/. Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline,

work page 2025
[27]

URLhttps://arxiv.org/abs/1905.05950. ACL

work page arXiv 1905
[28]

Boshi Wang, Xiang Yue, Yu Su, and Huan Sun

URLhttps://arxiv.org/abs/2309.02390. Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization,

work page arXiv
[29]

URL https://arxiv.org/abs/ 2405.15071. NeurIPS

work page arXiv
[30]

URLhttps://arxiv.org/abs/2310.01405. A Appendix B Experimental Details B.1 Sequence-level accuracy and cross-base comparability Let Deval ={(x i, yi, ni)}N i=1 denote the held-out evaluation set, where xi is the input digit sequence, yi is the target output digit sequence for one-step Collatz prediction, and ni is the underlying integer. Let ˆyi =f θ(xi) ...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

18 Preprint

This suggests that the improvement from depth is not explained by width or parameter count alone. 18 Preprint. Under review. The depth-4 run also exhibits a sharp transient instability near step 100,000: overall accuracy drops from 88.4% to 74.3% in a single checkpoint, then recovers to 90.4% two checkpoints later and continues rising to the best final pe...

work page 2022