Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
Pith reviewed 2026-05-16 10:21 UTC · model grok-4.3
The pith
Curricula for LLM pretraining mainly change how long models spend in shared latent phases rather than creating new phases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these stability differences are smaller. These results indicate that the curricula studied here are associated with more stable within-phase training in smaller models rather than with the creation of new phases.
What carries the argument
Latent training phases tracked by gradient noise scale (GNS) and singular-value structure of the output head, which measure how data order affects phase duration and within-phase stability.
If this is right
- Random ordering produces higher gradient noise scale in 14M-70M parameter models.
- Random ordering triggers singular-entropy spikes in the output head for models up to 160M parameters.
- Ascending verb-variation ordering improves accuracy while the descending reverse version removes most of that gain.
- Differences in stability between curricula become smaller once models reach hundreds of millions of parameters.
- Curricula improve within-phase stability rather than altering the overall sequence of phases.
Where Pith is reading between the lines
- Curriculum design could shift focus from inventing new phases to tuning how long a model lingers in each existing phase.
- The shared phase sequence across orders suggests core learning dynamics in language models are largely order-independent.
- Repeating the phase analysis on models beyond 1B parameters would test whether stability gains continue to shrink.
- Linking each phase to measurable capabilities such as syntax or fact retention could clarify what changes when phase durations shift.
Load-bearing premise
The phases detected through gradient noise scale and singular-value analysis of the output head reflect real, consistent stages of the learning process across model sizes.
What would settle it
A curriculum that produces a distinct new latent phase absent from random ordering, or a phase sequence that fails to align when the same analysis is repeated on models of different sizes.
read the original abstract
Curriculum learning changes the order of pretraining data, but it remains unclear how ordering changes the learning dynamics. We pretrain models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula--Age-of-Acquisition, word frequency, and Verb Variation (VV)--and compare each against Random ordering. We analyze latent training phases, gradient noise scale (GNS), and the singular-value structure of the output head. We find that training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these stability differences are smaller. These results indicate that the curricula studied here are associated with more stable within-phase training in smaller models rather than with the creation of new phases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines how curriculum learning affects LLM pretraining dynamics by pretraining models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula (Age-of-Acquisition, word frequency, Verb Variation) versus random ordering. Using analyses of latent training phases via gradient noise scale (GNS) trajectories and output-head singular-value structure, it concludes that all settings follow a shared sequence of phases while curricula primarily modulate phase durations and within-phase stability (with benefits clearest at smaller scales); a reverse-order VV control shows curriculum direction matters.
Significance. If the phase segmentation is shown to be robust and reproducible, the work would provide a useful mechanistic account of why curricula improve pretraining: they stabilize existing phases rather than alter the underlying sequence. The multi-scale experiments and reverse-order control are strengths that could inform more principled curriculum design for efficient LLM training.
major comments (1)
- [latent phase analysis (GNS and singular-value sections)] The central claim that curricula preserve an identical ordered sequence of latent phases (only changing durations and stability) rests on visual segmentation of GNS and singular-value entropy plots. No explicit, automated boundary-detection procedure (change-point algorithm, derivative threshold, clustering rule, or inter-run consistency metric) is provided, raising the risk that boundaries are placed post-hoc to maintain sequence alignment. This is load-bearing for the distinction between 'shared sequence' and 'changed durations' and must be formalized with a reproducible rule before the conclusion can be accepted.
minor comments (2)
- [Abstract] The abstract omits any description of the phase-identification method, statistical tests, or error bars; adding these details would improve clarity without altering the core argument.
- [Figures] Figures displaying GNS trajectories and singular-value entropy should include error bars or shaded regions from multiple runs to convey variability, especially when claiming stability differences.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the value of our multi-scale experiments and reverse-order control. We address the major comment on formalizing the latent phase analysis below and will revise the manuscript accordingly to improve reproducibility.
read point-by-point responses
-
Referee: The central claim that curricula preserve an identical ordered sequence of latent phases (only changing durations and stability) rests on visual segmentation of GNS and singular-value entropy plots. No explicit, automated boundary-detection procedure (change-point algorithm, derivative threshold, clustering rule, or inter-run consistency metric) is provided, raising the risk that boundaries are placed post-hoc to maintain sequence alignment. This is load-bearing for the distinction between 'shared sequence' and 'changed durations' and must be formalized with a reproducible rule before the conclusion can be accepted.
Authors: We agree that an explicit, automated boundary-detection procedure would strengthen the reproducibility of the phase segmentation. In the manuscript, phases were delineated by identifying consistent inflection points and transitions in the GNS trajectories (U-shaped patterns) and abrupt shifts in output-head singular-value entropy, which aligned across all model scales (14M to 1B) and curricula in our experiments. While this visual approach was grounded in the clear structural changes observed uniformly, we acknowledge the referee's concern about potential post-hoc placement. In the revision, we will add a formal method section describing a reproducible rule: we will apply the PELT change-point detection algorithm to smoothed GNS time series with a penalty parameter calibrated on the random-ordering baseline runs to detect the major transitions. We will also report an inter-run consistency metric (variance in boundary locations across seeds) and confirm that the detected sequence remains identical while durations vary. This will be applied uniformly to all conditions and will support the central claim without changing the findings. revision: yes
Circularity Check
Empirical comparisons of curricula show no reduction to self-referential definitions or fitted predictions
full rationale
The paper reports direct pretraining experiments from 14M to 1B parameters under linguistically motivated curricula versus random ordering, measuring GNS trajectories and output-head singular-value entropy. Claims about shared latent phase sequences and curricula modulating durations rest on observed consistency across runs and scales rather than any equation that defines phases in terms of the curriculum effect itself. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing; the analysis is self-contained against the reported metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and orbit embedding unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across all orderings, training follows a shared sequence of latent phases. Curricula mainly change which data appears within each phase... joint HMM analysis across orderings... shared HMM state transition diagram
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel and J-cost convexity unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
curriculum learning as a stability mechanism... bounding gradient variance... Theorem 3.2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.