How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Pith reviewed 2026-05-17 05:23 UTC · model grok-4.3
The pith
Standard learning rate decay largely cancels the benefits of ordering training data by increasing quality in LLM pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Curriculum-based pretraining in ascending order of data quality substantially outperforms random shuffling when using a constant learning rate, but this advantage diminishes under standard decaying learning rate schedules. The incompatibility can be mitigated by employing a more moderate LR decay schedule where the final LR is only moderately smaller than the peak LR, or by replacing LR decay with model averaging of the final few checkpoints, producing a 1.64 percent average benchmark improvement without further data refinement.
What carries the argument
The timing mismatch between high-quality data presentation in an ascending curriculum and the progressive reduction of learning rate, which limits the weight of updates on the best data.
If this is right
- Moderate learning rate decay keeps the performance edge of ascending-quality curricula intact.
- Model averaging of late checkpoints works as a drop-in replacement for aggressive decay and recovers curriculum benefits.
- Data curricula and optimization schedules must be designed together to make full use of high-quality subsets.
- Mixed-quality datasets can deliver better results through schedule adjustments alone, without extra curation.
Where Pith is reading between the lines
- Standard learning-rate schedules developed for random data order may systematically undervalue ordered curricula.
- Placing higher-quality data earlier or keeping learning rates higher longer could amplify the same effect.
- The result suggests similar schedule adjustments may help in other sequential training regimes where data quality varies.
Load-bearing premise
The performance lift from quality-based ordering is limited primarily by when the learning rate is still high rather than by data diversity or model capacity.
What would settle it
Run the same 1.5B model training with a moderate decay schedule and observe that ascending-quality ordering produces no measurable gain over random shuffling on the benchmark suite.
Figures
read the original abstract
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that ascending-quality curriculum learning for LLM pretraining is incompatible with standard decaying learning-rate schedules. While curriculum ordering substantially outperforms random shuffling under constant LR, the advantage largely vanishes under cosine decay because high-quality data arrives late when the LR is small. The authors propose two mitigations—moderate final LR and model averaging of late checkpoints—and report a 1.64% average benchmark improvement on 1.5B-parameter models trained for 30B tokens across multiple quality metrics.
Significance. If the central empirical finding holds, the work offers a practical explanation for why prior curriculum-based pretraining studies have shown only modest gains and supplies simple, zero-cost fixes that improve data utilization. The validation on 1.5B models with 30B tokens and several quality metrics supplies a concrete, reproducible basis for re-examining the co-design of data order and optimization schedules in large-scale training.
major comments (2)
- [Abstract and §4] The evidence for the claim that LR decay specifically wastes high-quality data rests on the performance gap between constant-LR and decaying-LR conditions (Abstract and §4). This comparison does not isolate whether the high-quality tail is under-learned at low LR; final benchmark deltas could arise from other interactions between ordering and optimization trajectory. Direct measurements such as per-quality-subset loss curves or gradient statistics at late training stages would strengthen the causal link.
- [Abstract and Results] The reported 1.64% average improvement and the claim that curriculum advantage “diminishes” under standard decay lack run-to-run variance, number of seeds, or statistical tests. Without these, it is difficult to judge whether the observed differences are robust or could be explained by optimization stochasticity alone.
minor comments (2)
- [§3] Clarify the exact implementation of the quality metrics and the precise definition of the ascending curriculum (e.g., how ties or batch construction are handled) to aid reproducibility.
- [Figures in §4] Figure captions and axis labels should explicitly state the LR schedule (constant vs. cosine) and the data-ordering condition for each curve.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of our empirical results.
read point-by-point responses
-
Referee: [Abstract and §4] The evidence for the claim that LR decay specifically wastes high-quality data rests on the performance gap between constant-LR and decaying-LR conditions (Abstract and §4). This comparison does not isolate whether the high-quality tail is under-learned at low LR; final benchmark deltas could arise from other interactions between ordering and optimization trajectory. Direct measurements such as per-quality-subset loss curves or gradient statistics at late training stages would strengthen the causal link.
Authors: The experimental design holds all factors fixed except data ordering and LR schedule, allowing direct comparison of curriculum versus random ordering under constant LR (where curriculum yields large gains) versus decaying LR (where gains largely vanish). This differential outcome isolates the interaction between ascending quality order and LR decay. We agree that more granular measurements would further support the interpretation. In the revision we will add per-quality-subset loss curves and late-stage gradient statistics to provide direct evidence that high-quality data receives insufficient optimization under standard decay. revision: yes
-
Referee: [Abstract and Results] The reported 1.64% average improvement and the claim that curriculum advantage “diminishes” under standard decay lack run-to-run variance, number of seeds, or statistical tests. Without these, it is difficult to judge whether the observed differences are robust or could be explained by optimization stochasticity alone.
Authors: We agree that variance reporting is essential for assessing robustness. All reported results were obtained from multiple independent training runs using different random seeds. In the revised manuscript we will explicitly state the number of seeds, report means with standard deviations (or error bars) for the benchmark averages including the 1.64% gain, and include appropriate statistical comparisons to confirm that the differences exceed run-to-run variability. revision: yes
Circularity Check
Purely empirical study with no circular derivation chain
full rationale
The paper reports direct experimental comparisons of curriculum ordering versus random shuffling on 1.5B models trained for 30B tokens, using multiple quality metrics and benchmark suites. All central claims rest on measured performance deltas under constant LR versus cosine decay, plus two mitigation strategies (moderate decay and model averaging). No equations, fitted parameters, uniqueness theorems, or ansatzes are invoked; results are externally falsifiable via replication on the stated data and schedules. The work contains no derivation that reduces to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.