pith. sign in

arxiv: 2605.26484 · v1 · pith:CND7DI2Dnew · submitted 2026-05-26 · 💻 cs.LG

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

classification 💻 cs.LG
keywords extra-mergemergingmodelsubspaceemphlanguagepre-trainingrank-1
0
0 comments X
read the original abstract

Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.