Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

Tianyi Li; Zhiqiang Shen

arxiv: 2606.23607 · v1 · pith:VTR57UYCnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

Tianyi Li , Zhiqiang Shen This is my paper

Pith reviewed 2026-06-26 08:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords linear mode connectivitymodel mergingpretrained transformersweight transformationsdual optimizationloss barrierslarge language modelsvision transformers

0 comments

The pith

Bidirectional optimization of functionality-preserving weight transformations enables near-zero loss barriers when linearly interpolating large pretrained transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that independently trained large transformers can be aligned so their parameters connect via simple straight lines in weight space without large performance drops. It does this by letting both models simultaneously learn transformations that preserve what each model does, rather than adjusting only one side. This produces a shared linear path that keeps loss or accuracy nearly constant during interpolation. For medium-sized language models the barriers drop to near zero on WikiText, and billion-parameter models show only small remaining barriers while ViT-L stays above 69 percent ImageNet accuracy. If correct, the result means model merging at scale can rely on basic linear averaging once symmetries are resolved.

Core claim

Properly parameterized functionality-preserving weight transformations combined with a dual learning procedure, in which both endpoint models jointly optimize their transformations toward one shared linear interpolation path, substantially reduce interpolation barriers and enable reliable merging of billion-parameter pretrained transformers.

What carries the argument

The dual learning procedure that jointly optimizes bidirectional functionality-preserving weight transformations to produce a shared linear interpolation path.

If this is right

Linear interpolation becomes a practical route for merging large language models once symmetries are aligned.
The same alignment procedure scales from medium-sized transformers to billion-parameter models with only small remaining barriers.
Vision transformers such as ViT-L maintain high accuracy across the entire linear path after the transformations are learned.
Model merging no longer requires searching for complex nonlinear paths when the dual procedure is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many independently trained large models may share latent functional equivalences that linear paths can capture once symmetries are removed.
The approach could be tested on other generative architectures such as diffusion models to check whether the same bidirectional alignment reduces barriers there.
If the transformations remain stable under continued training, the method might support incremental merging of models fine-tuned on successive tasks.

Load-bearing premise

Functionality-preserving weight transformations exist that can be jointly optimized from both models without lowering the performance of either original network.

What would settle it

After running the bidirectional optimization on a specific billion-parameter LLM, measuring a loss barrier that remains as large as the unaligned case would show the method does not deliver the claimed reduction.

Figures

Figures reproduced from arXiv: 2606.23607 by Tianyi Li, Zhiqiang Shen.

**Figure 2.** Figure 2: Schematic loss-landscape view of single and dual learned matching. Direct interpolation [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Summary of interpolation and parameterization results. (a) LMC-DM across ViT-Small, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their scalability and effectiveness for large pretrained transformers. We propose a novel and scalable framework for enabling LMC-based model merging to {\em billion-parameter pretrained transformers}. Our method applies properly parameterized functionality-preserving weight transformations to align functionally equivalent solutions, and introduces a dual learning procedure in which both models jointly learn their corresponding transformations toward a shared linear interpolation path. This bidirectional optimization substantially reduces interpolation barriers and enables more reliable merging across large-scale architectures. Empirically, we show that our approach achieves near-zero loss barriers on WikiText for language models with medium-sized parameters, representing, to our knowledge, the first demonstration of near-barrier-free linear connectivity at this scale. In the vision domain, ViT-L maintains above 69\% ImageNet top-1 accuracy throughout the interpolation path, while modern billion-parameter LLMs exhibit only small loss barriers. These results suggest that properly resolving parameter symmetries enables large pretrained Transformers to be connected and merged through simple linear paths with substantially improved interpolation performance. Code: https://github.com/VILA-Lab/Dual-Learned-Matching .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bidirectional dual optimization gets near-zero LMC barriers on medium-scale LMs and small ones on billion-param models, but the abstract leaves the implementation and controls opaque.

read the letter

The headline result is that a joint bidirectional optimization of functionality-preserving transformations produces near-zero loss barriers on WikiText for medium-sized language models and only small barriers for billion-parameter LLMs, with ViT-L staying above 69% accuracy on the interpolation path. That is the concrete advance over prior one-sided LMC work.

The method itself is a straightforward extension: parameterize the transformations, then let both models optimize their maps toward a shared linear path instead of anchoring one endpoint. The abstract presents this as the first near-barrier-free demonstration at that scale, and the numbers are the main evidence offered.

What is solid is the empirical direction. If the full paper shows the transformations really preserve functionality and the optimization does not simply overfit the interpolation objective, the scaling claim matters for anyone trying to merge large pretrained models without retraining. The GitHub link is a plus for reproducibility.

The soft spots are exactly where the reader flagged them. The abstract gives no error bars, no ablation on the bidirectional step versus one-sided, no verification that the learned maps do not degrade the individual models, and no detail on how the transformations are parameterized at billion-parameter scale. Those omissions make it impossible to judge whether the barriers are robust or sensitive to post-hoc choices. The "to our knowledge first" phrasing also needs the citation record checked in the full text.

This paper is for groups already working on model merging, permutation symmetries, and compute-efficient ensembling. A reader who cares about practical LMC at scale will get value from the numbers if the methods section holds up. It is coherent on its own terms and engages the existing LMC literature, so it deserves a serious referee even if the experiments require tightening.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a scalable framework for achieving linear mode connectivity (LMC) in billion-parameter pretrained transformers. It applies parameterized functionality-preserving weight transformations and introduces a dual learning procedure in which both models jointly optimize their transformations toward a shared linear interpolation path. The central empirical claims are near-zero loss barriers on WikiText for medium-sized language models (first such demonstration at this scale), ViT-L maintaining above 69% ImageNet top-1 accuracy throughout the interpolation path, and only small loss barriers for modern billion-parameter LLMs.

Significance. If the reported results hold under rigorous verification, the work would represent a meaningful advance by scaling LMC and model merging to billion-parameter regimes where prior methods have not achieved near-barrier-free connectivity. The bidirectional optimization approach addresses a known limitation of one-sided path optimization, and the public code release supports reproducibility.

minor comments (2)

[Abstract] Abstract: the claim of 'near-zero loss barriers' and 'first demonstration at this scale' would benefit from explicit numerical values (e.g., barrier height in nats or perplexity) and a direct comparison table against prior one-sided LMC baselines in the main text.
[Methods] The description of the dual optimization objective and the precise parameterization of the functionality-preserving transformations would be clearer with an explicit equation or pseudocode block early in the methods section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript, recognition of the bidirectional optimization approach, and recommendation for minor revision. We appreciate the acknowledgment that the results, if verified, would represent a meaningful advance in scaling LMC to billion-parameter regimes.

Circularity Check

0 steps flagged

No significant circularity; empirical method stands on its own

full rationale

The paper describes an optimization-based procedure (parameterized functionality-preserving transformations + bidirectional joint learning toward a shared interpolation path) whose outputs are measured empirically on held-out metrics such as loss barriers and ImageNet accuracy. No equations, uniqueness theorems, or self-citations are invoked that would make the reported connectivity results equivalent to the inputs by construction. The central claim therefore remains an independent empirical finding rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the parameterized transformations are mentioned but not detailed enough to enumerate.

pith-pipeline@v0.9.1-grok · 5745 in / 1028 out tokens · 24180 ms · 2026-06-26T08:38:15.611613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 19 canonical work pages · 6 internal anchors

[1]

A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

work page arXiv 2026
[2]

Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

2018
[3]

Git re-basin: Merging models modulo permutation symmetries

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

work page arXiv 2022
[4]

Symmetry in neural network parameter spaces.arXiv preprint arXiv:2506.13018, 2025

Bo Zhao, Robin Walters, and Rose Yu. Symmetry in neural network parameter spaces.arXiv preprint arXiv:2506.13018, 2025

work page arXiv 2025
[5]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational conference on machine learning, pages 3259–3269. PMLR, 2020

2020
[6]

arXiv preprint arXiv:2402.13144 , year=

Kai Wang, Dongwen Tang, Boya Zeng, Yida Yin, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network diffusion.arXiv preprint arXiv:2402.13144, 2024

work page arXiv 2024
[7]

Generative modeling of weights: General- ization or memorization?arXiv preprint arXiv:2506.07998, 2025

Boya Zeng, Yida Yin, Zhiqiu Xu, and Zhuang Liu. Generative modeling of weights: General- ization or memorization?arXiv preprint arXiv:2506.07998, 2025

work page arXiv 2025
[8]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Essentially no barriers in neural network energy landscape

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. InInternational conference on machine learning, pages 1309–1318. PMLR, 2018

2018
[10]

The butterfly effect: Neu- ral network training trajectories are highly sensitive to initial conditions.arXiv preprint arXiv:2506.13234, 2025

Devin Kwok, Gül Sena Altınta¸ s, Colin Raffel, and David Rolnick. The butterfly effect: Neu- ral network training trajectories are highly sensitive to initial conditions.arXiv preprint arXiv:2506.13234, 2025

work page arXiv 2025
[11]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[12]

The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

work page arXiv 2021
[13]

On the algebraic structure of feedforward network weight spaces

Robert Hecht-Nielsen. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pages 129–135. Elsevier, 1990

1990
[14]

On the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993

An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993

1993
[15]

Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

2020
[16]

Op- timizing mode connectivity via neuron alignment.Advances in Neural Information Processing Systems, 33:15300–15311, 2020

Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. Op- timizing mode connectivity via neuron alignment.Advances in Neural Information Processing Systems, 33:15300–15311, 2020

2020
[17]

Convergent learning: Do different neural networks learn the same representations? InFeature Extraction: Modern Questions and Challenges, pages 196–212

Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural networks learn the same representations? InFeature Extraction: Modern Questions and Challenges, pages 196–212. PMLR, 2015. 10

2015
[18]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024
[19]

Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, and Valentina Boeva. Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

work page arXiv 2025
[20]

Beyond the permutation sym- metry of transformers: The role of rotation for model fusion.arXiv preprint arXiv:2502.00264, 2025

Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, and Jundong Li. Beyond the permutation sym- metry of transformers: The role of rotation for model fusion.arXiv preprint arXiv:2502.00264, 2025

work page arXiv 2025
[21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[22]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[23]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

2023
[24]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Nai...

2024
[25]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[26]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

2016
[27]

Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

Zhanpeng Zhou, Yongyi Yang, Xiaojiang Yang, Junchi Yan, and Wei Hu. Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

2023
[28]

Layer-wise linear mode connectivity.arXiv preprint arXiv:2307.06966, 2023

Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, and Martin Jaggi. Layer-wise linear mode connectivity.arXiv preprint arXiv:2307.06966, 2023

work page arXiv 2023
[29]

Linear mode connectivity between multiple models modulo permutation symmetries

Akira Ito, Masanori Yamada, and Atsutoshi Kumagai. Linear mode connectivity between multiple models modulo permutation symmetries. InF orty-second International Conference on Machine Learning, 2025

2025
[30]

On linear mode connectivity of mixture-of-experts architectures.arXiv preprint arXiv:2509.11348, 2025

Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, and Tan M Nguyen. On linear mode connectivity of mixture-of-experts architectures.arXiv preprint arXiv:2509.11348, 2025

work page arXiv 2025
[31]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026. 11

2026
[33]

Merging text transformer models from different initializations

Neha Verma and Maha Elbayad. Merging text transformer models from different initializations. arXiv preprint arXiv:2403.00986, 2024

work page arXiv 2024
[34]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Fineweb-edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu

2024
[36]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 12 Appendix A Limitations Our framework relies on explicitly parameterized functionality-preserving symmetries, and there- fore may not capture all sources of misalignment between independently pretrained models. The ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

work page arXiv 2026

[2] [2]

Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

2018

[3] [3]

Git re-basin: Merging models modulo permutation symmetries

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

work page arXiv 2022

[4] [4]

Symmetry in neural network parameter spaces.arXiv preprint arXiv:2506.13018, 2025

Bo Zhao, Robin Walters, and Rose Yu. Symmetry in neural network parameter spaces.arXiv preprint arXiv:2506.13018, 2025

work page arXiv 2025

[5] [5]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational conference on machine learning, pages 3259–3269. PMLR, 2020

2020

[6] [6]

arXiv preprint arXiv:2402.13144 , year=

Kai Wang, Dongwen Tang, Boya Zeng, Yida Yin, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network diffusion.arXiv preprint arXiv:2402.13144, 2024

work page arXiv 2024

[7] [7]

Generative modeling of weights: General- ization or memorization?arXiv preprint arXiv:2506.07998, 2025

Boya Zeng, Yida Yin, Zhiqiu Xu, and Zhuang Liu. Generative modeling of weights: General- ization or memorization?arXiv preprint arXiv:2506.07998, 2025

work page arXiv 2025

[8] [8]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Essentially no barriers in neural network energy landscape

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. InInternational conference on machine learning, pages 1309–1318. PMLR, 2018

2018

[10] [10]

The butterfly effect: Neu- ral network training trajectories are highly sensitive to initial conditions.arXiv preprint arXiv:2506.13234, 2025

Devin Kwok, Gül Sena Altınta¸ s, Colin Raffel, and David Rolnick. The butterfly effect: Neu- ral network training trajectories are highly sensitive to initial conditions.arXiv preprint arXiv:2506.13234, 2025

work page arXiv 2025

[11] [11]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[12] [12]

The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

work page arXiv 2021

[13] [13]

On the algebraic structure of feedforward network weight spaces

Robert Hecht-Nielsen. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pages 129–135. Elsevier, 1990

1990

[14] [14]

On the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993

An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993

1993

[15] [15]

Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

2020

[16] [16]

Op- timizing mode connectivity via neuron alignment.Advances in Neural Information Processing Systems, 33:15300–15311, 2020

Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. Op- timizing mode connectivity via neuron alignment.Advances in Neural Information Processing Systems, 33:15300–15311, 2020

2020

[17] [17]

Convergent learning: Do different neural networks learn the same representations? InFeature Extraction: Modern Questions and Challenges, pages 196–212

Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural networks learn the same representations? InFeature Extraction: Modern Questions and Challenges, pages 196–212. PMLR, 2015. 10

2015

[18] [18]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024

[19] [19]

Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, and Valentina Boeva. Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

work page arXiv 2025

[20] [20]

Beyond the permutation sym- metry of transformers: The role of rotation for model fusion.arXiv preprint arXiv:2502.00264, 2025

Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, and Jundong Li. Beyond the permutation sym- metry of transformers: The role of rotation for model fusion.arXiv preprint arXiv:2502.00264, 2025

work page arXiv 2025

[21] [21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[22] [22]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[23] [23]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

2023

[24] [24]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Nai...

2024

[25] [25]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[26] [26]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

2016

[27] [27]

Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

Zhanpeng Zhou, Yongyi Yang, Xiaojiang Yang, Junchi Yan, and Wei Hu. Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

2023

[28] [28]

Layer-wise linear mode connectivity.arXiv preprint arXiv:2307.06966, 2023

Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, and Martin Jaggi. Layer-wise linear mode connectivity.arXiv preprint arXiv:2307.06966, 2023

work page arXiv 2023

[29] [29]

Linear mode connectivity between multiple models modulo permutation symmetries

Akira Ito, Masanori Yamada, and Atsutoshi Kumagai. Linear mode connectivity between multiple models modulo permutation symmetries. InF orty-second International Conference on Machine Learning, 2025

2025

[30] [30]

On linear mode connectivity of mixture-of-experts architectures.arXiv preprint arXiv:2509.11348, 2025

Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, and Tan M Nguyen. On linear mode connectivity of mixture-of-experts architectures.arXiv preprint arXiv:2509.11348, 2025

work page arXiv 2025

[31] [31]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026. 11

2026

[33] [33]

Merging text transformer models from different initializations

Neha Verma and Maha Elbayad. Merging text transformer models from different initializations. arXiv preprint arXiv:2403.00986, 2024

work page arXiv 2024

[34] [34]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [35]

Fineweb-edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu

2024

[36] [36]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 12 Appendix A Limitations Our framework relies on explicitly parameterized functionality-preserving symmetries, and there- fore may not capture all sources of misalignment between independently pretrained models. The ...

work page internal anchor Pith review Pith/arXiv arXiv 2024