pith. sign in

arxiv: 2606.23607 · v1 · pith:VTR57UYCnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

Pith reviewed 2026-06-26 08:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords linear mode connectivitymodel mergingpretrained transformersweight transformationsdual optimizationloss barrierslarge language modelsvision transformers
0
0 comments X

The pith

Bidirectional optimization of functionality-preserving weight transformations enables near-zero loss barriers when linearly interpolating large pretrained transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that independently trained large transformers can be aligned so their parameters connect via simple straight lines in weight space without large performance drops. It does this by letting both models simultaneously learn transformations that preserve what each model does, rather than adjusting only one side. This produces a shared linear path that keeps loss or accuracy nearly constant during interpolation. For medium-sized language models the barriers drop to near zero on WikiText, and billion-parameter models show only small remaining barriers while ViT-L stays above 69 percent ImageNet accuracy. If correct, the result means model merging at scale can rely on basic linear averaging once symmetries are resolved.

Core claim

Properly parameterized functionality-preserving weight transformations combined with a dual learning procedure, in which both endpoint models jointly optimize their transformations toward one shared linear interpolation path, substantially reduce interpolation barriers and enable reliable merging of billion-parameter pretrained transformers.

What carries the argument

The dual learning procedure that jointly optimizes bidirectional functionality-preserving weight transformations to produce a shared linear interpolation path.

If this is right

  • Linear interpolation becomes a practical route for merging large language models once symmetries are aligned.
  • The same alignment procedure scales from medium-sized transformers to billion-parameter models with only small remaining barriers.
  • Vision transformers such as ViT-L maintain high accuracy across the entire linear path after the transformations are learned.
  • Model merging no longer requires searching for complex nonlinear paths when the dual procedure is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many independently trained large models may share latent functional equivalences that linear paths can capture once symmetries are removed.
  • The approach could be tested on other generative architectures such as diffusion models to check whether the same bidirectional alignment reduces barriers there.
  • If the transformations remain stable under continued training, the method might support incremental merging of models fine-tuned on successive tasks.

Load-bearing premise

Functionality-preserving weight transformations exist that can be jointly optimized from both models without lowering the performance of either original network.

What would settle it

After running the bidirectional optimization on a specific billion-parameter LLM, measuring a loss barrier that remains as large as the unaligned case would show the method does not deliver the claimed reduction.

Figures

Figures reproduced from arXiv: 2606.23607 by Tianyi Li, Zhiqiang Shen.

Figure 1
Figure 1. Figure 1: Loss barriers and scaling behavior. Left: Interpolation loss barriers for ViT-S under naive [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic loss-landscape view of single and dual learned matching. Direct interpolation [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Summary of interpolation and parameterization results. (a) LMC-DM across ViT-Small, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their scalability and effectiveness for large pretrained transformers. We propose a novel and scalable framework for enabling LMC-based model merging to {\em billion-parameter pretrained transformers}. Our method applies properly parameterized functionality-preserving weight transformations to align functionally equivalent solutions, and introduces a dual learning procedure in which both models jointly learn their corresponding transformations toward a shared linear interpolation path. This bidirectional optimization substantially reduces interpolation barriers and enables more reliable merging across large-scale architectures. Empirically, we show that our approach achieves near-zero loss barriers on WikiText for language models with medium-sized parameters, representing, to our knowledge, the first demonstration of near-barrier-free linear connectivity at this scale. In the vision domain, ViT-L maintains above 69\% ImageNet top-1 accuracy throughout the interpolation path, while modern billion-parameter LLMs exhibit only small loss barriers. These results suggest that properly resolving parameter symmetries enables large pretrained Transformers to be connected and merged through simple linear paths with substantially improved interpolation performance. Code: https://github.com/VILA-Lab/Dual-Learned-Matching .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a scalable framework for achieving linear mode connectivity (LMC) in billion-parameter pretrained transformers. It applies parameterized functionality-preserving weight transformations and introduces a dual learning procedure in which both models jointly optimize their transformations toward a shared linear interpolation path. The central empirical claims are near-zero loss barriers on WikiText for medium-sized language models (first such demonstration at this scale), ViT-L maintaining above 69% ImageNet top-1 accuracy throughout the interpolation path, and only small loss barriers for modern billion-parameter LLMs.

Significance. If the reported results hold under rigorous verification, the work would represent a meaningful advance by scaling LMC and model merging to billion-parameter regimes where prior methods have not achieved near-barrier-free connectivity. The bidirectional optimization approach addresses a known limitation of one-sided path optimization, and the public code release supports reproducibility.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'near-zero loss barriers' and 'first demonstration at this scale' would benefit from explicit numerical values (e.g., barrier height in nats or perplexity) and a direct comparison table against prior one-sided LMC baselines in the main text.
  2. [Methods] The description of the dual optimization objective and the precise parameterization of the functionality-preserving transformations would be clearer with an explicit equation or pseudocode block early in the methods section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript, recognition of the bidirectional optimization approach, and recommendation for minor revision. We appreciate the acknowledgment that the results, if verified, would represent a meaningful advance in scaling LMC to billion-parameter regimes.

Circularity Check

0 steps flagged

No significant circularity; empirical method stands on its own

full rationale

The paper describes an optimization-based procedure (parameterized functionality-preserving transformations + bidirectional joint learning toward a shared interpolation path) whose outputs are measured empirically on held-out metrics such as loss barriers and ImageNet accuracy. No equations, uniqueness theorems, or self-citations are invoked that would make the reported connectivity results equivalent to the inputs by construction. The central claim therefore remains an independent empirical finding rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the parameterized transformations are mentioned but not detailed enough to enumerate.

pith-pipeline@v0.9.1-grok · 5745 in / 1028 out tokens · 24180 ms · 2026-06-26T08:38:15.611613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 19 canonical work pages · 6 internal anchors

  1. [1]

    A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

    Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

  2. [2]

    Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

  3. [3]

    Git re-basin: Merging models modulo permutation symmetries

    Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

  4. [4]

    Symmetry in neural network parameter spaces.arXiv preprint arXiv:2506.13018, 2025

    Bo Zhao, Robin Walters, and Rose Yu. Symmetry in neural network parameter spaces.arXiv preprint arXiv:2506.13018, 2025

  5. [5]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational conference on machine learning, pages 3259–3269. PMLR, 2020

  6. [6]

    arXiv preprint arXiv:2402.13144 , year=

    Kai Wang, Dongwen Tang, Boya Zeng, Yida Yin, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network diffusion.arXiv preprint arXiv:2402.13144, 2024

  7. [7]

    Generative modeling of weights: General- ization or memorization?arXiv preprint arXiv:2506.07998, 2025

    Boya Zeng, Yida Yin, Zhiqiu Xu, and Zhuang Liu. Generative modeling of weights: General- ization or memorization?arXiv preprint arXiv:2506.07998, 2025

  8. [8]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

  9. [9]

    Essentially no barriers in neural network energy landscape

    Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. InInternational conference on machine learning, pages 1309–1318. PMLR, 2018

  10. [10]

    The butterfly effect: Neu- ral network training trajectories are highly sensitive to initial conditions.arXiv preprint arXiv:2506.13234, 2025

    Devin Kwok, Gül Sena Altınta¸ s, Colin Raffel, and David Rolnick. The butterfly effect: Neu- ral network training trajectories are highly sensitive to initial conditions.arXiv preprint arXiv:2506.13234, 2025

  11. [11]

    Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

    Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

  12. [12]

    The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

    Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

  13. [13]

    On the algebraic structure of feedforward network weight spaces

    Robert Hecht-Nielsen. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pages 129–135. Elsevier, 1990

  14. [14]

    On the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993

    An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993

  15. [15]

    Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

    Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

  16. [16]

    Op- timizing mode connectivity via neuron alignment.Advances in Neural Information Processing Systems, 33:15300–15311, 2020

    Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. Op- timizing mode connectivity via neuron alignment.Advances in Neural Information Processing Systems, 33:15300–15311, 2020

  17. [17]

    Convergent learning: Do different neural networks learn the same representations? InFeature Extraction: Modern Questions and Challenges, pages 196–212

    Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural networks learn the same representations? InFeature Extraction: Modern Questions and Challenges, pages 196–212. PMLR, 2015. 10

  18. [18]

    Slicegpt: Compress large language models by deleting rows and columns

    Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

  19. [19]

    Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

    Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, and Valentina Boeva. Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

  20. [20]

    Beyond the permutation sym- metry of transformers: The role of rotation for model fusion.arXiv preprint arXiv:2502.00264, 2025

    Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, and Jundong Li. Beyond the permutation sym- metry of transformers: The role of rotation for model fusion.arXiv preprint arXiv:2502.00264, 2025

  21. [21]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  22. [22]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  23. [23]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

  24. [24]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Nai...

  25. [25]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  26. [26]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  27. [27]

    Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

    Zhanpeng Zhou, Yongyi Yang, Xiaojiang Yang, Junchi Yan, and Wei Hu. Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

  28. [28]

    Layer-wise linear mode connectivity.arXiv preprint arXiv:2307.06966, 2023

    Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, and Martin Jaggi. Layer-wise linear mode connectivity.arXiv preprint arXiv:2307.06966, 2023

  29. [29]

    Linear mode connectivity between multiple models modulo permutation symmetries

    Akira Ito, Masanori Yamada, and Atsutoshi Kumagai. Linear mode connectivity between multiple models modulo permutation symmetries. InF orty-second International Conference on Machine Learning, 2025

  30. [30]

    On linear mode connectivity of mixture-of-experts architectures.arXiv preprint arXiv:2509.11348, 2025

    Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, and Tan M Nguyen. On linear mode connectivity of mixture-of-experts architectures.arXiv preprint arXiv:2509.11348, 2025

  31. [31]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

  32. [32]

    Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026. 11

  33. [33]

    Merging text transformer models from different initializations

    Neha Verma and Maha Elbayad. Merging text transformer models from different initializations. arXiv preprint arXiv:2403.00986, 2024

  34. [34]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  35. [35]

    Fineweb-edu: the finest collection of educational content, 2024

    Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu

  36. [36]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 12 Appendix A Limitations Our framework relies on explicitly parameterized functionality-preserving symmetries, and there- fore may not capture all sources of misalignment between independently pretrained models. The ...