CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

arxiv: 2601.05858 · v2 · submitted 2026-01-09 · 💻 cs.CL · cs.AI· cs.LG

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Alexandra Dragomir , Florin Brad , Radu Tudor Ionescu This is my paper

Pith reviewed 2026-05-16 15:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords curriculum learningpreference optimizationmachine translationcatastrophic forgettinglarge language modelsCLewR

0 comments p. Extension

The pith

Repeating an easy-to-hard curriculum multiple times during preference optimization improves machine translation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the sequence in which training examples are presented matters for preference optimization in machine translation. Standard single-pass curricula risk forgetting easy cases once harder ones dominate, so the authors add restarts that cycle back through the same ordering. This change produces measurable gains when plugged into existing preference methods and tested on several model families. A reader would care because the modification is lightweight yet addresses a known weakness in how alignment-style training handles data difficulty.

Core claim

The central discovery is that a curriculum learning strategy with restarts (CLewR) improves multilingual machine translation when added to preference optimization. By cycling the easy-to-hard ordering repeatedly instead of once, the method reduces catastrophic forgetting of simpler examples while still progressing to harder ones, yielding consistent performance lifts across Gemma2, Qwen2.5, and Llama3.1 models and across multiple preference optimization algorithms.

What carries the argument

CLewR, the curriculum learning strategy with restarts that applies the easy-to-hard sample ordering several times during a single training run to keep easy examples from being forgotten.

If this is right

Preference optimization pipelines for machine translation can be strengthened by controlling data presentation order rather than relying on random shuffling alone.
Restarting the curriculum preserves gains on simpler translation directions while models continue to improve on harder ones.
The same restart pattern produces benefits across different large language model families and different preference optimization algorithms.
The approach adds little computational overhead and requires no new model architectures.
Releasing the code allows direct integration into other multilingual alignment workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same restart mechanism could be tested in other continual-learning settings where early-acquired capabilities are lost, such as instruction tuning or vision-language tasks.
Future experiments might vary the number of restarts as a function of model scale or dataset size to see whether an optimal cycle count exists.
The results imply that sample ordering is an under-explored lever in preference learning, worth comparing against other scheduling ideas such as difficulty-aware batching.

Load-bearing premise

That cycling the curriculum will reliably protect performance on easy examples without creating new instabilities or forcing extensive retuning for each model and dataset.

What would settle it

Training the same models and datasets with the identical preference optimization setup but without any restarts, then measuring whether accuracy on easy translation pairs drops at the same rate and overall MT quality shows no net gain.

read the original abstract

Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLewR adds restarts to an easy-to-hard curriculum in MT preference optimization to limit forgetting of simple examples, but the experiments do not cleanly separate that mechanism from longer training runs.

read the letter

The main thing here is a scheduling change for preference optimization on machine translation data: sort examples by difficulty, train easy to hard, then restart the curriculum from the beginning to keep easy examples from being forgotten. They apply this to Gemma2, Qwen2.5, and Llama3.1 with several preference methods and report consistent gains. The code is released, which helps anyone who wants to check the implementation details. What is actually new is the explicit restart loop inside the curriculum; prior MT preference papers have not focused on cycling back to easy data this way. The approach is lightweight and can be dropped into existing pipelines without new loss functions or model changes, which is a practical plus. The multi-model testing gives the results a bit more weight than a single-model study would. The soft spot is that the paper does not isolate whether the restarts themselves drive the improvement. Without an ablation that matches total training steps between a single-pass curriculum and the restarted version, or without tracking accuracy on the easiest examples across cycles, the gains could simply reflect more optimization steps or tuning effects rather than better retention. The stress-test note captures this accurately. This is useful for people already running preference tuning on translation data who want a simple ordering trick to try. A reader working on multilingual LLMs or DPO-style methods would get the most out of it. I would send it to peer review because the core idea is straightforward, the evaluation covers several models, and referees can ask for the missing controls if the numbers look promising.

Referee Report

3 major / 2 minor

Summary. The paper proposes CLewR, a curriculum-learning-with-restarts schedule integrated into preference optimization (DPO, IPO, etc.) for multilingual machine translation. The core idea is to repeat easy-to-hard passes over the preference data multiple times during training so that easy examples are not catastrophically forgotten; the authors report consistent BLEU and COMET gains on several model families (Gemma-2, Qwen-2.5, Llama-3.1) and publicly release the code.

Significance. If the restart schedule is shown to be the causal driver of retention rather than simply longer optimization, the method offers a lightweight, hyper-parameter-light way to stabilize curriculum-based preference tuning for MT. The code release supports reproducibility and allows the community to test the schedule on new models and datasets.

major comments (3)

[§4] §4 (Experiments): the paper reports aggregate gains but does not isolate the restart mechanism. No ablation compares (a) single-pass curriculum with the same total number of steps against (b) the restarted CLewR schedule, nor is per-cycle accuracy on the easiest difficulty quartile tracked to demonstrate retention of easy examples.
[§3.2] §3.2 (CLewR description): the restart interval and number of cycles are presented as fixed hyper-parameters without a sensitivity study; it is therefore unclear whether the reported improvements generalize or require per-model retuning of the restart schedule.
[Table 2] Table 2 / Figure 3: the baseline preference-optimization runs appear to use the original learning-rate schedule; any interaction between the curriculum restarts and the optimizer’s LR decay is not controlled, leaving open the possibility that gains arise from the effective training length rather than the forgetting-mitigation claim.

minor comments (2)

[Abstract] The abstract states “consistent gains” without any numerical values; a one-sentence summary of the largest observed improvement (e.g., “+1.2 BLEU on average”) would help readers gauge magnitude before reading the full experimental section.
[Eq. 3] Notation for difficulty scoring (Eq. 3) is introduced without an explicit statement of how the score is normalized across language pairs; this makes it hard to reproduce the exact curriculum ordering.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We address each major point below and will incorporate revisions to strengthen the experimental analysis.

read point-by-point responses

Referee: [§4] §4 (Experiments): the paper reports aggregate gains but does not isolate the restart mechanism. No ablation compares (a) single-pass curriculum with the same total number of steps against (b) the restarted CLewR schedule, nor is per-cycle accuracy on the easiest difficulty quartile tracked to demonstrate retention of easy examples.

Authors: We agree that isolating the restart mechanism is valuable. In the revised manuscript we will add an ablation that trains a single-pass curriculum for exactly the same total number of steps as CLewR, and we will report per-cycle accuracy on the easiest difficulty quartile to directly demonstrate retention of simple examples. revision: yes
Referee: [§3.2] §3.2 (CLewR description): the restart interval and number of cycles are presented as fixed hyper-parameters without a sensitivity study; it is therefore unclear whether the reported improvements generalize or require per-model retuning of the restart schedule.

Authors: We acknowledge the absence of a sensitivity study. The revised version will include a sensitivity analysis over restart interval and number of cycles (reported in the appendix), showing that performance gains remain stable across a practical range of these values without requiring extensive per-model retuning. revision: yes
Referee: [Table 2] Table 2 / Figure 3: the baseline preference-optimization runs appear to use the original learning-rate schedule; any interaction between the curriculum restarts and the optimizer’s LR decay is not controlled, leaving open the possibility that gains arise from the effective training length rather than the forgetting-mitigation claim.

Authors: All runs (baselines and CLewR) use the identical total optimization budget and the same cosine learning-rate schedule applied over the full training duration. We will explicitly state this equivalence in the revised text and add a clarifying note that any observed gains cannot be attributed to longer training. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic schedule with no derivations or self-referential reductions

full rationale

The paper introduces CLewR as an empirical algorithmic schedule (repeated easy-to-hard passes with restarts) for preference optimization, evaluated across model families and techniques. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The central claim rests on experimental gains rather than any closed-form result that reduces to its own inputs by construction, satisfying the criteria for a self-contained non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical axioms, free parameters, or invented entities; it relies on standard assumptions of curriculum learning and preference optimization already present in the literature.

pith-pipeline@v0.9.0 · 5441 in / 1001 out tokens · 46713 ms · 2026-05-16T15:52:13.478820+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 / flipAt512 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLewR... which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples.
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the easy-to-hard data permutation is reused at every epoch

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.