Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Overtrained language models are harder to fine-tune
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 3years
2026 3roles
background 1polarities
background 1representative citing papers
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
citing papers explorer
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.