Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Following (Springer et al., 2025), we have the following initialization: Assumption C.8(Pretrained Initialization Scale).Let (W1(0),W 2(0)) be the parameters at initialization
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.