Overtrained language models are harder to fine-tune

URL https://arxiv · 1929 · arXiv 2503.19206

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

cs.LG · 2026-05-12 · conditional · novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.

citing papers explorer

Showing 3 of 3 citing papers.

Early Data Exposure Improves Robustness to Subsequent Fine-Tuning cs.LG · 2026-05-12 · conditional · none · ref 19
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less cs.LG · 2026-05-07 · unverdicted · none · ref 26
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima cs.LG · 2026-04-10 · unverdicted · none · ref 37
Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.

Overtrained language models are harder to fine-tune

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer