pith. machine review for the scientific record. sign in

arxiv: 2605.04396 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformersweight decayreasoningmemorizationcompositional generalizationtraining dynamicsout-of-distribution accuracycomplexity control
0
0 comments X

The pith

Transformers decide whether to reason or memorize during one narrow window of training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the choice between low-complexity reasoning and high-complexity memorization in Transformers is settled during a brief, identifiable period rather than across the entire training run. On a controlled compositional task, applying weight decay for only one-quarter of training produces out-of-distribution accuracy nearly identical to applying it throughout. Placing the same total regularization budget in the middle of training yields substantially better generalization than placing it early, and the window boundary is so sharp that shifting its start by 100 steps can move performance from chance level to the reasoning regime. The location of the window varies with initialization scale, yet smaller initializations shrink the region that leads to reasoning solutions. The effect does not appear on all tasks, such as modular arithmetic grokking.

Core claim

The memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task weight decay applied for a single 25%-of-training window matches full-training weight decay in out-of-distribution accuracy (0.93 vs 0.91). Holding total regularization budget constant, placing it in the middle of training yields 5-9× higher OOD accuracy than placing it early. The boundary of the critical window is remarkably sharp, window onset shifted by as little as 100 optimization steps causes mean OOD to jump from chance (0.15) to reasoning-regime (0.61). The window's position depends systematically on initialization scale, but the 盆的

What carries the argument

The critical window of training during which the timing of weight decay steers the model toward reasoning solutions rather than memorization.

If this is right

  • Weight decay applied during only 25% of training achieves out-of-distribution accuracy comparable to full-time application (0.93 versus 0.91).
  • Middle-of-training placement of the fixed regularization budget produces 5 to 9 times higher out-of-distribution accuracy than early placement.
  • A shift of 100 optimization steps in the start of the weight decay window can change mean out-of-distribution accuracy from 0.15 to 0.61.
  • The window location depends on initialization scale, and smaller initializations shrink the basin of attraction for reasoning solutions.
  • The critical-window effect is task-specific and absent on modular-arithmetic grokking where constant weight decay suffices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training schedules could focus regularization effort in the middle phase to reduce total compute while preserving generalization gains.
  • Different tasks may require distinct timing strategies for the same complexity-control mechanism.
  • Similar sharp periods could exist in other model scales or architectures and merit direct checks during standard pretraining runs.

Load-bearing premise

The controlled compositional task and chosen hyperparameter regimes represent how Transformers behave more generally on compositional problems.

What would settle it

On the same compositional task, finding that weight decay applied outside the identified 25% window still produces high out-of-distribution accuracy would show the window is not decisive.

Figures

Figures reproduced from arXiv: 2605.04396 by Sarwan Ali.

Figure 1
Figure 1. Figure 1: E1: phase diagram of OOD accuracy across (γ, λ). Each cell averages 3 seeds. The reasoning regime forms a horizontal stripe at λ ∈ {3×10−4 , 10−3}. Outside this stripe, both above and below in λ, models memorize. The widely cited recommendation that smaller γ is uniformly preferable [Zhang et al., 2025] is not supported: the reasoning regime is robust at γ ∈ [0.8, 1.1] and degrades as γ decreases. 5.2 The … view at source ↗
Figure 2
Figure 2. Figure 2: E2a: critical window scan. OOD accuracy for 5000-step windowed weight decay placed at varying onsets, γ=0.8, λ=4×10−3 . Bars show mean ± standard deviation over 3 seeds. The window [0, 5000) produces the same OOD as no weight decay (red dashed); windows starting at 2500 through 12500 steps reach the full-WD plateau (green dashed) at 25% of the cumulative regularization cost. The cliff is sharp. 5.3 Budget-… view at source ↗
Figure 3
Figure 3. Figure 3: E2b: same regularization budget, different placement. All six conditions apply R λ dt = 20, varying only when in training the budget is spent. Middle placements achieve 5−9× higher OOD than early placements at identical cumulative cost. Mean ± std over 3 seeds. Interpretation. The sharpness of the boundary is consistent with our theoretical analysis: in the very earliest stage of training, both the memoriz… view at source ↗
Figure 4
Figure 4. Figure 4: E6/E7: the early boundary is a near-step-function. OOD accuracy as window onset is swept at coarse (left, 500 steps) and fine (right, 100 steps) resolution, γ=0.8. The 0 → 500 transition is essentially complete: from chance to reasoning regime. Mean ± std over 3 seeds (E6) and 4 seeds (E7). linearized regime, extending it to include the softmax nonlinearity in the attention layer remains open and may sharp… view at source ↗
Figure 5
Figure 5. Figure 5: E5: critical window across initialization scales. OOD accuracy vs window onset for γ ∈ {0.5, 0.8, 1.1}. The shape is preserved but the reasoning-plateau height degrades sharply at small γ. Error bars show ± std over 3 seeds; the wide bars at γ=0.5 reveal high seed-level variance, motivating the basin-of-attraction analysis in E8. 21 view at source ↗
Figure 6
Figure 6. Figure 6: E8: basin of attraction shrinks at small γ. Per-seed OOD accuracy for 12 seeds at each γ. Red horizontal bars indicate means; dotted horizontal line indicates the OOD= 0.5 threshold. At γ=1.1, 12/12 seeds reach reasoning. At γ=0.5, only 8/12 do, and the four failures collapse to chance (0.18–0.27). Implication for the literature. [Zhang et al., 2025] recommend small γ as the path to reasoning solutions, an… view at source ↗
Figure 7
Figure 7. Figure 7: E10: depth ablation on the anchor task. The critical-window phenomenon persists at 4 layers but with reduced reasoning-plateau height and increased seed variance. Left: OOD accuracy over training for all 7 schedule conditions, 3 seeds per condition. Right: final OOD accuracy by schedule placement, mean ± std over 3 seeds. The qualitative pattern matches the 2-layer result ( view at source ↗
Figure 8
Figure 8. Figure 8: E11: critical-window phenomenon is robust to optimizer choice. Final OOD accuracy by schedule under AdamW (left) and SGD with momentum (right), 3 seeds per condition. Mean ± std over 3 seeds. In both cases, the middle window reaches the reasoning regime (AdamW 0.94 ± 0.08, SGD 0.99 ± 0.00) while the early window remains near chance (AdamW 0.12 ± 0.05, SGD 0.32 ± 0.16). Two further patterns are notable. Fir… view at source ↗
Figure 9
Figure 9. Figure 9: E3: online diagnostics across 40 runs at γ = 0.8. Color encodes weight decay value λ ∈ {0, 3×10−4 , 10−3 , 3×10−3 , 10−2} from dark to light. Left: condensation index at 20% of training vs final OOD. The relationship is non-monotonic: high OOD occupies the band C(t/T=0.2) ∈ [28, 36], while both extremes correspond to memorization. Center: bridge alignment at 20% of training vs final OOD; included for compl… view at source ↗
Figure 10
Figure 10. Figure 10: E4: scheduled vs constant weight decay on grokking. OOD accuracy vs training step on modular arithmetic (p=67, 40% train fraction), at the best constant-λ hyperparameter (λ ∗ = 0.01). Constant weight decay groks at step ≈ 700; the time-localized schedule groks at step ≈ 3500−3600. The critical-window phenomenon does not transfer. Both schedules eventually reach OOD ≈ 1.0 ( view at source ↗
read the original abstract

Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy ($0.93$ vs $0.91$); (ii)~holding total regularization budget constant, placing it in the middle of training yields $5{-}9\times$ higher OOD accuracy than placing it early; (iii)~the boundary of the critical window is remarkably sharp, window onset shifted by as little as $100$ optimization steps causes mean OOD to jump from chance ($0.15$) to reasoning-regime ($0.61$); (iv)~the window's position depends systematically on initialization scale, but the basin of attraction for reasoning solutions \emph{shrinks} at small initialization, contradicting the prevailing recommendation that smaller initialization is uniformly better. We further show that the critical-window phenomenon is task-specific: it does not appear on grokking with modular arithmetic, where properly tuned constant weight decay matches scheduled weight decay.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the memorization-versus-reasoning outcome in Transformers is decided within a sharp, identifiable critical window during training. On a controlled compositional task, applying weight decay for only a single 25%-of-training window matches full-training weight decay in OOD accuracy (0.93 vs 0.91); mid-training placement of a fixed regularization budget yields 5-9× higher OOD accuracy than early placement; the window boundary is sharp (100-step shifts move mean OOD from 0.15 to 0.61); window position depends on initialization scale but the reasoning basin shrinks at small initialization, contradicting uniform preference for small init; the phenomenon is absent on modular-arithmetic grokking where constant weight decay suffices.

Significance. If the empirical contrasts hold under full experimental scrutiny, the work would be significant for shifting focus from static hyperparameter selection to dynamic scheduling of complexity control. It supplies concrete, falsifiable predictions about window timing, sharpness, and initialization dependence on a compositional task, together with a negative result on grokking that bounds the scope. These elements could inform more efficient regularization schedules and challenge prevailing initialization heuristics.

major comments (3)
  1. [Abstract / Experiments] Abstract and experimental sections: the central quantitative claims (0.93 vs 0.91 OOD accuracy, 5-9× gains, 100-step boundary, basin shrinkage) are reported without error bars, number of independent runs, statistical tests, or a complete hyperparameter table. This information is load-bearing for the sharpness and magnitude assertions; its absence prevents confirmation that the observed windows are robust rather than artifacts of a single seed or narrow regime.
  2. [§2 / Task Setup] Task definition and controls: the controlled compositional task is described only at high level in the abstract. Without an explicit statement of the input distribution, composition depth, and how OOD examples are constructed (including any leakage controls), it is impossible to evaluate whether the reported critical-window effects generalize beyond the specific task or are tied to its particular statistics.
  3. [§4.3 / Initialization Ablations] Initialization-scale dependence: the claim that the reasoning basin shrinks at small initialization (contradicting the prevailing recommendation) is central yet rests on a single contrast. A fuller ablation across multiple scales with basin-volume estimates or multiple random seeds would be required to establish that this is not an interaction with the particular optimizer or task.
minor comments (2)
  1. [§3] Notation for the 25%-window and total regularization budget should be defined explicitly (e.g., as a fraction of total steps and as an integrated L2 penalty) to avoid ambiguity when readers attempt to reproduce the schedule.
  2. [§5] The statement that the phenomenon is 'task-specific' would be strengthened by a brief quantitative comparison table showing the grokking result alongside the compositional-task result rather than a qualitative assertion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of robustness and clarity. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where they strengthen the empirical claims without altering the core results.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental sections: the central quantitative claims (0.93 vs 0.91 OOD accuracy, 5-9× gains, 100-step boundary, basin shrinkage) are reported without error bars, number of independent runs, statistical tests, or a complete hyperparameter table. This information is load-bearing for the sharpness and magnitude assertions; its absence prevents confirmation that the observed windows are robust rather than artifacts of a single seed or narrow regime.

    Authors: We agree that statistical validation is essential to support the sharpness and magnitude of the reported effects. In the revised manuscript we will re-run the key experiments over 5 independent random seeds, report means with standard-deviation error bars, and include a complete hyperparameter table in the appendix. Where differences are central (e.g., the 5–9× OOD gains and the 100-step boundary), we will add simple statistical comparisons to confirm they are not seed-specific artifacts. revision: yes

  2. Referee: [§2 / Task Setup] Task definition and controls: the controlled compositional task is described only at high level in the abstract. Without an explicit statement of the input distribution, composition depth, and how OOD examples are constructed (including any leakage controls), it is impossible to evaluate whether the reported critical-window effects generalize beyond the specific task or are tied to its particular statistics.

    Authors: Section 2 of the manuscript already specifies the input distribution (synthetic sequences drawn from a depth-3 compositional grammar), the exact composition rules, and the OOD construction (novel combinations with explicit leakage controls that ensure no shared sub-structures beyond atomic tokens). Nevertheless, we acknowledge that a more self-contained presentation would aid readers. We will expand §2 with formal pseudocode for data generation, concrete numerical examples of in-distribution versus OOD instances, and an explicit statement of the leakage-prevention protocol. revision: partial

  3. Referee: [§4.3 / Initialization Ablations] Initialization-scale dependence: the claim that the reasoning basin shrinks at small initialization (contradicting the prevailing recommendation) is central yet rests on a single contrast. A fuller ablation across multiple scales with basin-volume estimates or multiple random seeds would be required to establish that this is not an interaction with the particular optimizer or task.

    Authors: We agree that a single contrast is insufficient to establish the initialization dependence robustly. In the revision we will extend the ablation to a wider range of initialization scales (0.01–1.0), report results over multiple random seeds, and provide approximate basin-volume estimates obtained by sampling multiple optimization trajectories per scale. These additions will confirm that the observed shrinkage of the reasoning basin at small initialization is not an artifact of the specific optimizer or task instance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents purely empirical results from controlled experiments on transformer training dynamics, with no mathematical derivation chain, no fitted functional forms, and no load-bearing self-citations. All quantitative claims (e.g., single-window weight decay matching full training, 5-9× OOD gains from mid-training placement, 100-step boundary sharpness) are direct experimental contrasts on a specific compositional task, explicitly scoped as task-specific and absent on modular arithmetic grokking. No step reduces to its own inputs by construction or via self-referential definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study is experimental; no mathematical derivation exists. The central observations rest on the assumption that the synthetic compositional task measures reasoning versus memorization and that controlled regularization budgets isolate timing effects.

free parameters (1)
  • critical window position and duration
    25% training window and middle placement chosen and tested to match full-training performance; specific values are experimental choices.
axioms (1)
  • domain assumption The synthetic compositional task serves as a valid proxy for measuring compositional generalization and reasoning in transformers
    Used to interpret OOD accuracy differences as evidence of reasoning versus memorization.

pith-pipeline@v0.9.0 · 5571 in / 1396 out tokens · 63190 ms · 2026-05-08T17:56:24.623575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Complexity control facilitates reasoning-based compositional generalization in transformers , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  2. [2]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

    An Analysis for Reasoning Bias of Language Models with Small Initialization , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

  3. [3]

    Advances in Neural Information Processing Systems , year =

    From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics , author =. Advances in Neural Information Processing Systems , year =

  4. [4]

    Proceedings of the National Academy of Sciences , volume=

    Out-of-distribution generalization via composition: a lens through induction heads in transformers , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=

  5. [5]

    arXiv preprint arXiv:2502.15801 , year=

    An explainable transformer circuit for compositional generalization , author=. arXiv preprint arXiv:2502.15801 , year=

  6. [6]

    In-context Learning and Induction Heads

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Towards understanding the condensation of neural networks at initial training , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    arXiv preprint arXiv:2305.09947 , year=

    Understanding the initial condensation of convolutional neural networks , author=. arXiv preprint arXiv:2305.09947 , year=

  9. [9]

    An overview of condensation phenomenon in deep learning

    An overview of condensation phenomenon in deep learning , author=. arXiv preprint arXiv:2504.09484 , year=

  10. [10]

    International conference on learning representations , year=

    Critical learning periods in deep networks , author=. International conference on learning representations , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    International Conference on Learning Representations (ICLR) , year =

    Critical Learning Periods Emerge Even in Deep Linear Networks , author =. International Conference on Learning Representations (ICLR) , year =

  13. [13]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

  14. [14]

    International Conference on Learning Representations (ICLR) , year =

    Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations (ICLR) , year =

  15. [15]

    International Conference on Learning Representations (ICLR) , year =

    Omnigrok: Grokking Beyond Algorithmic Data , author =. International Conference on Learning Representations (ICLR) , year =

  16. [16]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

    Grokking Beyond the Euclidean Norm of Model Parameters , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

  17. [17]

    Advances in neural information processing systems , volume=

    On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=

  18. [18]

    Conference on Learning Theory , pages=

    Kernel and rich regimes in overparametrized models , author=. Conference on Learning Theory , pages=. 2020 , organization=

  19. [19]

    Advances in neural information processing systems , volume=

    Implicit regularization in deep matrix factorization , author=. Advances in neural information processing systems , volume=

  20. [20]

    Advances in neural information processing systems , volume=

    Exploring generalization in deep learning , author=. Advances in neural information processing systems , volume=

  21. [21]

    Advances in neural information processing systems , volume=

    Implicit regularization in matrix factorization , author=. Advances in neural information processing systems , volume=

  22. [22]

    Advances in neural information processing systems , year=

    Attention is all you need , author=. Advances in neural information processing systems , year=

  23. [23]

    International Conference on Learning Representations (ICLR) , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , year =

  24. [24]

    Advances in neural information processing systems , year=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , year=

  25. [25]

    2023 , publisher=

    Topics in random matrix theory , author=. 2023 , publisher=

  26. [26]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  27. [27]

    International conference on machine learning , pages=

    Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International conference on machine learning , pages=. 2018 , organization=