How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Chenyi Dang; Haodong Wen; Jiarui Cui; Kaifeng Lyu; Kairong Luo; Wenguang Chen; Xinyu Shi; Zhenbo Sun

arxiv: 2511.18903 · v3 · submitted 2025-11-24 · 💻 cs.LG · cs.AI· cs.CL

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo , Zhenbo Sun , Haodong Wen , Xinyu Shi , Jiarui Cui , Chenyi Dang , Kaifeng Lyu , Wenguang Chen This is my paper

Pith reviewed 2026-05-17 05:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords curriculum learninglearning rate decayLLM pretrainingdata quality orderingmodel averagingoptimization schedulemixed-quality data

0 comments

The pith

Standard learning rate decay largely cancels the benefits of ordering training data by increasing quality in LLM pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that curriculum ordering of data from low to high quality produces clear gains over random shuffling when the learning rate remains constant throughout training. Under the conventional schedules that steadily lower the learning rate, however, those gains largely disappear because the highest-quality data arrives only after the learning rate has already been reduced. The authors show that two straightforward adjustments restore the advantage: using a milder decay that keeps the final learning rate closer to its peak value, or replacing decay entirely with a weighted average of the last few model checkpoints. These changes yield a 1.64 percent average improvement on standard benchmarks for 1.5-billion-parameter models trained on 30 billion tokens drawn from mixed-quality data.

Core claim

Curriculum-based pretraining in ascending order of data quality substantially outperforms random shuffling when using a constant learning rate, but this advantage diminishes under standard decaying learning rate schedules. The incompatibility can be mitigated by employing a more moderate LR decay schedule where the final LR is only moderately smaller than the peak LR, or by replacing LR decay with model averaging of the final few checkpoints, producing a 1.64 percent average benchmark improvement without further data refinement.

What carries the argument

The timing mismatch between high-quality data presentation in an ascending curriculum and the progressive reduction of learning rate, which limits the weight of updates on the best data.

If this is right

Moderate learning rate decay keeps the performance edge of ascending-quality curricula intact.
Model averaging of late checkpoints works as a drop-in replacement for aggressive decay and recovers curriculum benefits.
Data curricula and optimization schedules must be designed together to make full use of high-quality subsets.
Mixed-quality datasets can deliver better results through schedule adjustments alone, without extra curation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard learning-rate schedules developed for random data order may systematically undervalue ordered curricula.
Placing higher-quality data earlier or keeping learning rates higher longer could amplify the same effect.
The result suggests similar schedule adjustments may help in other sequential training regimes where data quality varies.

Load-bearing premise

The performance lift from quality-based ordering is limited primarily by when the learning rate is still high rather than by data diversity or model capacity.

What would settle it

Run the same 1.5B model training with a moderate decay schedule and observe that ascending-quality ordering produces no measurable gain over random shuffling on the benchmark suite.

Figures

Figures reproduced from arXiv: 2511.18903 by Chenyi Dang, Haodong Wen, Jiarui Cui, Kaifeng Lyu, Kairong Luo, Wenguang Chen, Xinyu Shi, Zhenbo Sun.

**Figure 1.** Figure 1: Data curriculum strategies are less effective when combined with learning rate (LR) schedules that decay to a low scale near the end. (a-c) Experiments on a 1.5B parameter model trained on 30B tokens compare various data curricula (Uniform, Ascending-Order, and Descending-Order by DCLM score (Li et al., 2024)) under constant, Warmup-Stable-Decay (WSD) (Hu et al., 2024; Hagele et al., 2024), and cosine sche… view at source ↗

**Figure 2.** Figure 2: When varying the decay steps across 37%, 18%, 6% and 0% of training (Long, Mid, Short, Zero, respectively) and ending LRs (1 × 10−5 , 1 × 10−3 , 2 × 10−3 , 3 × 10−3 ), the benefit of data curriculum diminishes with more aggressive LR decay. For each LR decay, we train 1.5B-parameter models with uniform and ascending ordering of data based on DCLM scores, and measure the difference in validation loss. As sh… view at source ↗

**Figure 3.** Figure 3: A stage-wise “data folding” curriculum mitigates the negative interaction observed between data ordering and learning rate (LR) decay (detailed in Section 4), but data folding can not match end-to-end sorting under a constant learning rate. Left: We compare simple ascending curricula (Ascend), sorted by DCLM score, against their “folding” counterparts (Ascend+Folding). The folding method involves partition… view at source ↗

**Figure 4.** Figure 4: Visualization of our intuition about the interplay between data ordering and LR schedules. We assume the gradient update can be decomposed as a signal direction and a noise direction. High-quality data can offer a less noisy direction and a more stable signal direction, while low-quality data can induce a more noisy update. Uniform+Decay, Ascend+Decay and Ascend+EMA represent different training strategies.… view at source ↗

**Figure 5.** Figure 5: This figure compares various training strategies, identifying a high-performing and previously underexplored Optimal Regime where moderate learning rate (LR) decay, weight averaging, and curriculum learning produce synergistic advantages. We run experiments on both Uniform (uniformly ordered data) and Ascend (training data arranged by ascending DCLM scores) data schedules. For both schedules, we conduct a… view at source ↗

**Figure 6.** Figure 6: Visualization of the simulation experiments of the theoretical example. The mean trajectory is averaged over R = 20 runs. The yellow star marks the global optimal, and w1 represents a signal direction and w2 represents a noise direction. The data samples are distributed evenly along the signal direction and randomly located along the noise direction. Ascend+WSMD and Ascend+EMA win by sufficient progress al… view at source ↗

**Figure 7.** Figure 7: Downstream task scores and validation losses show high correlation according to the Pearson correlation coefficient (r) and R-square value (R 2 ). Average is the average score of the total 8 downstream t,asks and Core is the average score of the first 4 downstream tasks (MMLU, ARC-c/e, CSQA) in Tables 1 and 2. 1-sqrt decay function, η(t) = η0 1 − p r(t) + ηT p r(t), and the sqrt-cube function, η(t) = … view at source ↗

**Figure 8.** Figure 8: The benefits of a data curriculum using PreSelect scores also diminish. We show the validation loss curves for constant and WSD LR schedules under different data schedules, including uniform, ascending, and descending orders by PreSelect scores. Overall, the ascending curriculum outperforms the uniform baseline under a constant schedule, but cannot match it under the WSD LR schedule. The final validation l… view at source ↗

read the original abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that standard LR decay undercuts curriculum gains in LLM pretraining because high-quality data arrives too late, and milder decay or checkpoint averaging recovers most of the benefit.

read the letter

The punchline is that curriculum-based pretraining for LLMs gets held back by the standard decaying learning rate schedule. The high-quality data comes last, when the learning rate is already low, so the model doesn't learn as much from it as it could. What the paper does well is lay out a clear comparison. They show that ordering data by quality beats random shuffling when the learning rate stays constant. But under the usual cosine decay, that advantage mostly goes away. Then they try two fixes: a more moderate decay that keeps the final rate higher, and using model averaging on the last checkpoints instead of decaying. On 1.5B parameter models trained for 30B tokens, these changes lead to a 1.64% better average on benchmarks across a few different quality metrics. No extra data work needed. This interaction between ordering and scheduler hasn't been called out this directly before, so that's the new part. It makes sense to think about co-designing the two. The evidence is a bit indirect though. The paper compares final scores but doesn't show measurements of how well the late data is actually learned at low learning rates. Other things about the training trajectory might explain the differences. They also don't report variance across runs or give full details on the baseline setups, which leaves some uncertainty about how big and reliable the improvements are. The quality metric's role versus other data properties like diversity isn't fully separated out. This is useful for anyone doing large-scale LLM pretraining with mixed data qualities. Engineers looking for small tweaks to get more out of their compute would find it practical. It has enough going for it to go to a serious referee, though the experiments could use more statistical rigor. I'd recommend putting it through peer review. The core finding is worth testing further in the community.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that ascending-quality curriculum learning for LLM pretraining is incompatible with standard decaying learning-rate schedules. While curriculum ordering substantially outperforms random shuffling under constant LR, the advantage largely vanishes under cosine decay because high-quality data arrives late when the LR is small. The authors propose two mitigations—moderate final LR and model averaging of late checkpoints—and report a 1.64% average benchmark improvement on 1.5B-parameter models trained for 30B tokens across multiple quality metrics.

Significance. If the central empirical finding holds, the work offers a practical explanation for why prior curriculum-based pretraining studies have shown only modest gains and supplies simple, zero-cost fixes that improve data utilization. The validation on 1.5B models with 30B tokens and several quality metrics supplies a concrete, reproducible basis for re-examining the co-design of data order and optimization schedules in large-scale training.

major comments (2)

[Abstract and §4] The evidence for the claim that LR decay specifically wastes high-quality data rests on the performance gap between constant-LR and decaying-LR conditions (Abstract and §4). This comparison does not isolate whether the high-quality tail is under-learned at low LR; final benchmark deltas could arise from other interactions between ordering and optimization trajectory. Direct measurements such as per-quality-subset loss curves or gradient statistics at late training stages would strengthen the causal link.
[Abstract and Results] The reported 1.64% average improvement and the claim that curriculum advantage “diminishes” under standard decay lack run-to-run variance, number of seeds, or statistical tests. Without these, it is difficult to judge whether the observed differences are robust or could be explained by optimization stochasticity alone.

minor comments (2)

[§3] Clarify the exact implementation of the quality metrics and the precise definition of the ascending curriculum (e.g., how ties or batch construction are handled) to aid reproducibility.
[Figures in §4] Figure captions and axis labels should explicitly state the LR schedule (constant vs. cosine) and the data-ordering condition for each curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of our empirical results.

read point-by-point responses

Referee: [Abstract and §4] The evidence for the claim that LR decay specifically wastes high-quality data rests on the performance gap between constant-LR and decaying-LR conditions (Abstract and §4). This comparison does not isolate whether the high-quality tail is under-learned at low LR; final benchmark deltas could arise from other interactions between ordering and optimization trajectory. Direct measurements such as per-quality-subset loss curves or gradient statistics at late training stages would strengthen the causal link.

Authors: The experimental design holds all factors fixed except data ordering and LR schedule, allowing direct comparison of curriculum versus random ordering under constant LR (where curriculum yields large gains) versus decaying LR (where gains largely vanish). This differential outcome isolates the interaction between ascending quality order and LR decay. We agree that more granular measurements would further support the interpretation. In the revision we will add per-quality-subset loss curves and late-stage gradient statistics to provide direct evidence that high-quality data receives insufficient optimization under standard decay. revision: yes
Referee: [Abstract and Results] The reported 1.64% average improvement and the claim that curriculum advantage “diminishes” under standard decay lack run-to-run variance, number of seeds, or statistical tests. Without these, it is difficult to judge whether the observed differences are robust or could be explained by optimization stochasticity alone.

Authors: We agree that variance reporting is essential for assessing robustness. All reported results were obtained from multiple independent training runs using different random seeds. In the revised manuscript we will explicitly state the number of seeds, report means with standard deviations (or error bars) for the benchmark averages including the 1.64% gain, and include appropriate statistical comparisons to confirm that the differences exceed run-to-run variability. revision: yes

Circularity Check

0 steps flagged

Purely empirical study with no circular derivation chain

full rationale

The paper reports direct experimental comparisons of curriculum ordering versus random shuffling on 1.5B models trained for 30B tokens, using multiple quality metrics and benchmark suites. All central claims rest on measured performance deltas under constant LR versus cosine decay, plus two mitigation strategies (moderate decay and model averaging). No equations, fitted parameters, uniqueness theorems, or ansatzes are invoked; results are externally falsifiable via replication on the stated data and schedules. The work contains no derivation that reduces to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper reports empirical observations and simple schedule changes rather than any theoretical derivation.

pith-pipeline@v0.9.0 · 5583 in / 1092 out tokens · 49446 ms · 2026-05-17T05:23:42.065692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

[1] vec ( #1 )

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

[1] vec ( #1 )

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page