Following (Springer et al., 2025), we have the following initialization: Assumption C.8(Pretrained Initialization Scale).Let (W1(0),W 2(0)) be the parameters at initialization

Next, we will characterize the initialization scale of model before pretraining · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.LG · 2026-05-12 · conditional · novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

Showing 1 of 1 citing paper.

Early Data Exposure Improves Robustness to Subsequent Fine-Tuning cs.LG · 2026-05-12 · conditional · none · ref 24
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.