Merging Text Transformer Models from Different Initializations

Maha Elbayad; Neha Verma

arxiv: 2403.00986 · v3 · pith:YQ2P4H6Cnew · submitted 2024-03-01 · 💻 cs.CL · cs.AI· cs.LG

Merging Text Transformer Models from Different Initializations

Neha Verma , Maha Elbayad This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords modelsmergingminimamodeltransformerworkarchitecturedifferent

0 comments

read the original abstract

Recent work on permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging, across models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs
cs.CR 2026-06 unverdicted novelty 7.0

Tiered Language Models use a secret key to induce an alternative computation graph over shared weights, enabling private capabilities in the keyed mode while the public mode shows none.
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
cs.LG 2026-06 unverdicted novelty 5.0

A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.