Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Hadi Amiri; Mohamed Elgaar

arxiv: 2601.21698 · v2 · submitted 2026-01-29 · 💻 cs.LG · cs.AI

Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Mohamed Elgaar , Hadi Amiri This is my paper

Pith reviewed 2026-05-16 10:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords curriculum learningLLM pretraininglearning dynamicsgradient noise scalesingular value structurelatent phasesdata orderingmodel scale

0 comments

The pith

Curricula for LLM pretraining mainly change how long models spend in shared latent phases rather than creating new phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper pretrains models from 14M to 1B parameters on 300B tokens under age-of-acquisition, word frequency, and verb-variation curricula and compares them to random ordering. It tracks latent phases using gradient noise scale and the singular-value structure of the output head. All orderings follow the same phase sequence, but curricula mainly shift the time spent inside each phase. Random ordering shows higher noise and output-head saturation at smaller scales, while a reverse verb-variation order loses the gains of the forward version. These stability effects shrink at larger scales.

Core claim

Training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these stability differences are smaller. These results indicate that the curricula studied here are associated with more stable within-phase training in smaller models rather than with the creation of new phases.

What carries the argument

Latent training phases tracked by gradient noise scale (GNS) and singular-value structure of the output head, which measure how data order affects phase duration and within-phase stability.

If this is right

Random ordering produces higher gradient noise scale in 14M-70M parameter models.
Random ordering triggers singular-entropy spikes in the output head for models up to 160M parameters.
Ascending verb-variation ordering improves accuracy while the descending reverse version removes most of that gain.
Differences in stability between curricula become smaller once models reach hundreds of millions of parameters.
Curricula improve within-phase stability rather than altering the overall sequence of phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curriculum design could shift focus from inventing new phases to tuning how long a model lingers in each existing phase.
The shared phase sequence across orders suggests core learning dynamics in language models are largely order-independent.
Repeating the phase analysis on models beyond 1B parameters would test whether stability gains continue to shrink.
Linking each phase to measurable capabilities such as syntax or fact retention could clarify what changes when phase durations shift.

Load-bearing premise

The phases detected through gradient noise scale and singular-value analysis of the output head reflect real, consistent stages of the learning process across model sizes.

What would settle it

A curriculum that produces a distinct new latent phase absent from random ordering, or a phase sequence that fails to align when the same analysis is repeated on models of different sizes.

read the original abstract

Curriculum learning changes the order of pretraining data, but it remains unclear how ordering changes the learning dynamics. We pretrain models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula--Age-of-Acquisition, word frequency, and Verb Variation (VV)--and compare each against Random ordering. We analyze latent training phases, gradient noise scale (GNS), and the singular-value structure of the output head. We find that training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these stability differences are smaller. These results indicate that the curricula studied here are associated with more stable within-phase training in smaller models rather than with the creation of new phases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Curricula mainly adjust how long models linger in the usual pretraining phases and improve stability at small scales, without creating new ones.

read the letter

The central finding is that linguistically motivated curricula change the time spent in a shared sequence of latent phases and reduce gradient noise at smaller scales, rather than altering the phase structure itself. Random ordering shows higher GNS early on and late entropy spikes, while the ascending VV curriculum helps stability; the reverse-order control confirms direction matters. At 1B scale those differences shrink, which lines up with the idea that ordering effects are more pronounced when capacity is limited.

Referee Report

1 major / 2 minor

Summary. The manuscript examines how curriculum learning affects LLM pretraining dynamics by pretraining models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula (Age-of-Acquisition, word frequency, Verb Variation) versus random ordering. Using analyses of latent training phases via gradient noise scale (GNS) trajectories and output-head singular-value structure, it concludes that all settings follow a shared sequence of phases while curricula primarily modulate phase durations and within-phase stability (with benefits clearest at smaller scales); a reverse-order VV control shows curriculum direction matters.

Significance. If the phase segmentation is shown to be robust and reproducible, the work would provide a useful mechanistic account of why curricula improve pretraining: they stabilize existing phases rather than alter the underlying sequence. The multi-scale experiments and reverse-order control are strengths that could inform more principled curriculum design for efficient LLM training.

major comments (1)

[latent phase analysis (GNS and singular-value sections)] The central claim that curricula preserve an identical ordered sequence of latent phases (only changing durations and stability) rests on visual segmentation of GNS and singular-value entropy plots. No explicit, automated boundary-detection procedure (change-point algorithm, derivative threshold, clustering rule, or inter-run consistency metric) is provided, raising the risk that boundaries are placed post-hoc to maintain sequence alignment. This is load-bearing for the distinction between 'shared sequence' and 'changed durations' and must be formalized with a reproducible rule before the conclusion can be accepted.

minor comments (2)

[Abstract] The abstract omits any description of the phase-identification method, statistical tests, or error bars; adding these details would improve clarity without altering the core argument.
[Figures] Figures displaying GNS trajectories and singular-value entropy should include error bars or shaded regions from multiple runs to convey variability, especially when claiming stability differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our multi-scale experiments and reverse-order control. We address the major comment on formalizing the latent phase analysis below and will revise the manuscript accordingly to improve reproducibility.

read point-by-point responses

Referee: The central claim that curricula preserve an identical ordered sequence of latent phases (only changing durations and stability) rests on visual segmentation of GNS and singular-value entropy plots. No explicit, automated boundary-detection procedure (change-point algorithm, derivative threshold, clustering rule, or inter-run consistency metric) is provided, raising the risk that boundaries are placed post-hoc to maintain sequence alignment. This is load-bearing for the distinction between 'shared sequence' and 'changed durations' and must be formalized with a reproducible rule before the conclusion can be accepted.

Authors: We agree that an explicit, automated boundary-detection procedure would strengthen the reproducibility of the phase segmentation. In the manuscript, phases were delineated by identifying consistent inflection points and transitions in the GNS trajectories (U-shaped patterns) and abrupt shifts in output-head singular-value entropy, which aligned across all model scales (14M to 1B) and curricula in our experiments. While this visual approach was grounded in the clear structural changes observed uniformly, we acknowledge the referee's concern about potential post-hoc placement. In the revision, we will add a formal method section describing a reproducible rule: we will apply the PELT change-point detection algorithm to smoothed GNS time series with a penalty parameter calibrated on the random-ordering baseline runs to detect the major transitions. We will also report an inter-run consistency metric (variance in boundary locations across seeds) and confirm that the detected sequence remains identical while durations vary. This will be applied uniformly to all conditions and will support the central claim without changing the findings. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons of curricula show no reduction to self-referential definitions or fitted predictions

full rationale

The paper reports direct pretraining experiments from 14M to 1B parameters under linguistically motivated curricula versus random ordering, measuring GNS trajectories and output-head singular-value entropy. Claims about shared latent phase sequences and curricula modulating durations rest on observed consistency across runs and scales rather than any equation that defines phases in terms of the curriculum effect itself. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing; the analysis is self-contained against the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no explicit free parameters fitted to results, no new axioms, and no invented entities. Relies on standard assumptions that GNS and singular-value entropy are valid proxies for training dynamics.

pith-pipeline@v0.9.0 · 5478 in / 1087 out tokens · 43941 ms · 2026-05-16T10:21:48.018893+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across all orderings, training follows a shared sequence of latent phases. Curricula mainly change which data appears within each phase... joint HMM analysis across orderings... shared HMM state transition diagram
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel and J-cost convexity unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

curriculum learning as a stability mechanism... bounding gradient variance... Theorem 3.2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.