Can Heterogeneous Language Models Be Fused?

Jie Zhou; Liang He; Qi Feng; Qin Chen; Shilian Chen; Wen Wu; Xin Li

arxiv: 2604.01674 · v2 · pith:P3IV4HAInew · submitted 2026-04-02 · 💻 cs.AI

Can Heterogeneous Language Models Be Fused?

Shilian Chen , Jie Zhou , Qin Chen , Wen Wu , Xin Li , Qi Feng , Liang He This is my paper

Pith reviewed 2026-05-21 10:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords heterogeneous model fusionlanguage model mergingtopology-based alignmentconflict-aware denoisingmulti-source fusioncross-family generalization

0 comments

The pith

Heterogeneous language models can be fused by matching functional module structures instead of raw weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Model merging integrates expert models into one without the cost of running an ensemble, but it breaks down when the models come from different families such as Llama, Qwen, or Mistral. The paper introduces HeteroFusion to solve this by first aligning models through their functional module topology rather than exact parameter positions, then removing conflicting or noisy signals during the transfer. Analytical support shows that keeping the target model's basis while adding structured updates keeps the process stable. A reader would care because open ecosystems now contain many useful but architecturally mismatched experts, and successful fusion would let practitioners combine their strengths efficiently. The result is a single model that inherits complementary capabilities across heterogeneous sources.

Core claim

HeteroFusion consists of topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. Preserving the target adapter basis while predicting structured updates produces a stable and well-conditioned transfer process.

What carries the argument

topology-based alignment that transfers knowledge by matching functional module structures instead of raw tensor coordinates

If this is right

HeteroFusion outperforms strong merging, fusion, and ensemble baselines across heterogeneous transfer settings.
The method remains effective when fusing multiple sources from different model families.
It maintains performance even when some source models contain noise.
Cross-family generalization holds for architectures such as Llama, Qwen, and Mistral.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structural module matching may become a practical criterion for selecting which experts to combine in open repositories.
The same principle could be tested on non-language models that share partial functional topologies.
Preserving the target's basis during updates might generalize to other forms of model editing or continual learning.

Load-bearing premise

That matching functional module structures instead of raw tensor coordinates enables stable and effective knowledge transfer despite architectural mismatch and latent basis misalignment.

What would settle it

If HeteroFusion shows no gains over baselines when applied to models whose functional modules share no common structure, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.01674 by Jie Zhou, Liang He, Qi Feng, Qin Chen, Shilian Chen, Wen Wu, Xin Li.

**Figure 1.** Figure 1: Overview of HeteroFusion. Topology-Based Alignment maps compatible modules into unified contexts, and Conflict-Aware Denoising filters cross-source noise. and FuseChat [21] distill the knowledge of structurally diverse source models into a target model through continual training and token/logit alignment. More recent heterogeneous fusion systems, such as Bohdi [22], InfiFusion [23], Modular SkillPacks [24]… view at source ↗

**Figure 2.** Figure 2: Results under noisy sources. Transferability in Noisy-Source Settings To evaluate noise robustness, we introduce four task-irrelevant Llama experts (specializing in Chemistry, Biology, Politics, and Science) into the source pool. This shared architecture allows homogeneous merging baselines to participate but creates a highly noisy environment. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis of α and µgate. EMR-Merging TIES-Merging Breadcrumbs DELLA DARE (TIES+) Task Arithmetic GAC Unite FuseLLM HeteroFusion 76 78 80 82 84 Average Score (GLUE) 79.66 78.90 81.04 77.98 77.94 79.54 76.68 80.02 81.36 82.58 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 3.** Figure 3: Average performance on the GLUE benchmark. Transferability Across Diverse Task Families. To rigorously validate the broad applicability of our transfer mechanism beyond UIE-style tasks, we further evaluate HeteroFusion on the diverse GLUE benchmark. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Influence of target anchor variants. Target Anchor Variants. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HeteroFusion for fusing heterogeneous language models drawn from distinct families (e.g., Llama, Qwen, Mistral). The approach comprises topology-based alignment, which transfers knowledge by matching functional module structures rather than raw tensor coordinates, and conflict-aware denoising to suppress incompatible signals. An analytical argument is offered that preserving the target adapter basis while predicting structured updates yields a stable, well-conditioned transfer. Empirical claims assert consistent outperformance over merging, fusion, and ensemble baselines across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings.

Significance. If the central claims hold, the work would meaningfully extend model merging beyond the homogeneous-backbone regime that currently dominates the literature, addressing a practical gap in open ecosystems where experts are architecturally diverse. The provision of an analytical justification for stability is a potential strength if it is parameter-free or derives well-conditioned properties directly from the module-matching construction.

major comments (2)

[§3] §3 (Topology-based alignment): the central performance claim requires that topological module matching produces functionally aligned latent bases across families. The manuscript does not report any post-alignment verification (activation correlations, task-vector cosine similarity, or representational similarity analysis) that would confirm the matched modules implement isomorphic computations rather than merely sharing topological roles. Without such evidence, the subsequent conflict-aware denoising step cannot be guaranteed to suppress misalignment-induced noise.
[§4] §4 (Analytical justification): the derivation that preserving the target adapter basis guarantees well-conditioned transfer implicitly treats topological correspondence as semantic alignment. If attention or MLP blocks realize non-isomorphic functions across Llama vs. Qwen, the transferred deltas remain in misaligned coordinates; the conditioning argument then reduces to an assumption rather than a proven property. A concrete counter-example or sensitivity analysis under controlled basis rotation would strengthen this section.

minor comments (2)

[Abstract] Abstract and §5: the baselines are described only as 'strong merging, fusion, and ensemble baselines' without naming the specific methods or citing their original papers; explicit enumeration and hyper-parameter settings are required for reproducibility.
[§5] §5 (Experiments): the abstract-only view provides no quantitative tables, error bars, or dataset descriptions. Full results with statistical significance tests and ablation of the two proposed components must be included to support the outperformance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the requirements for validating the topology-based alignment and strengthening the analytical justification. We address each point below and have revised the manuscript to incorporate additional empirical verification and sensitivity analysis.

read point-by-point responses

Referee: [§3] §3 (Topology-based alignment): the central performance claim requires that topological module matching produces functionally aligned latent bases across families. The manuscript does not report any post-alignment verification (activation correlations, task-vector cosine similarity, or representational similarity analysis) that would confirm the matched modules implement isomorphic computations rather than merely sharing topological roles. Without such evidence, the subsequent conflict-aware denoising step cannot be guaranteed to suppress misalignment-induced noise.

Authors: We agree that direct post-alignment verification would provide stronger support for the claim that topological matching yields functionally aligned bases. In the revised manuscript, we have added representational similarity analysis (RSA) and activation correlation measurements between matched modules across families (Llama-Qwen and Llama-Mistral pairs). These results show substantially higher similarity for topologically matched modules compared to random or layer-index-based pairings, indicating that the alignment captures more than mere structural roles. We also include task-vector cosine similarities computed after alignment on held-out calibration data. These additions directly address the concern and support the role of conflict-aware denoising in mitigating residual misalignment. revision: yes
Referee: [§4] §4 (Analytical justification): the derivation that preserving the target adapter basis guarantees well-conditioned transfer implicitly treats topological correspondence as semantic alignment. If attention or MLP blocks realize non-isomorphic functions across Llama vs. Qwen, the transferred deltas remain in misaligned coordinates; the conditioning argument then reduces to an assumption rather than a proven property. A concrete counter-example or sensitivity analysis under controlled basis rotation would strengthen this section.

Authors: The analytical argument establishes well-conditioned transfer under the assumption that topological matching provides a meaningful correspondence, which is consistent with the cross-family empirical results. We acknowledge that the derivation does not independently prove semantic isomorphism. To address this, the revised manuscript now includes a controlled sensitivity analysis: we perturb the module matching by applying random basis rotations to simulate misalignment and demonstrate clear degradation in both conditioning metrics and downstream performance. This provides empirical evidence that the stability benefits depend on accurate topological correspondence rather than holding unconditionally. A full counter-example assuming completely non-isomorphic functions would require assumptions outside the paper's scope, but the added analysis strengthens the section as suggested. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent analytical support

full rationale

The paper introduces HeteroFusion via two explicit components (topology-based module matching and conflict-aware denoising) plus an analytical argument for basis preservation. These are presented as design choices justified by the heterogeneous setting rather than derived from or equivalent to the experimental outcomes. No equations or claims reduce a prediction to a fitted parameter by construction, and no load-bearing step relies on a self-citation chain that itself assumes the target result. The central performance claims rest on cross-setting benchmarks against external baselines, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method description does not introduce new postulated objects or fitted constants.

pith-pipeline@v0.9.0 · 5762 in / 1036 out tokens · 41692 ms · 2026-05-21T10:52:33.511803+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

conflict-aware denoising that suppresses incompatible or noisy transfer signals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026

Mingyang Song and Mao Zheng. Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026

work page arXiv 2026
[2]

Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

Weiqin Li, Yi Peng, Mengzhou Zhang, Lei Ding, Han Hu, and Li Shen. Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

work page arXiv 2023
[3]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InUAI, 2018

work page 2018
[5]

Loss surfaces, mode connectivity, and fast ensembling of dnns

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018

work page 2018
[6]

Essentially no barriers in neural network energy landscape

Felix Draxler, Kambis V oss, Fred Hamprecht, and Ullrich Kothe. Essentially no barriers in neural network energy landscape. InNeurIPS, 2018

work page 2018
[7]

Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Benjamin Recht, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022

work page 2022
[8]

Merging models with fisher-weighted averaging

Michael Matena and Colin Raffel. Merging models with fisher-weighted averaging. InNeurIPS, 2022

work page 2022
[9]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InICLR, 2023

work page 2023
[10]

Model fusion via optimal transport

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. InNeurIPS, 2020

work page 2020
[11]

Git re-basin: Merging models modulo permutation symmetries

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InICLR, 2023

work page 2023
[12]

Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022

Keller Jordan, Hanie Sedghi, Oleg Saukh, Rickard Entezari, and Behnam Neyshabur. Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022

work page arXiv 2022
[13]

Zipit! merging models from different tasks without training

George Stoica, Daniel Bolya, Jens Bjorner, Pratyusha Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training. InICML, 2023

work page 2023
[14]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

work page 2023
[15]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

work page 2023
[16]

Yıldız, C., Ravichandran, N

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

work page arXiv 2023
[17]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InECCV, 2024

work page 2024
[18]

Della-merging: Reducing interference in model merging through magnitude-based sampling

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024
[19]

Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024. 9

work page 2024
[20]

Knowledge fusion of large language models

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. InICLR, 2024

work page 2024
[21]

Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowledge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

work page arXiv 2024
[22]

Bohdi: Heterogeneous llm fusion with automatic data exploration

Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, and Biqing Qi. Bohdi: Heterogeneous llm fusion with automatic data exploration. InNeurIPS, 2025

work page 2025
[23]

Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

Zehao Yan et al. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

work page arXiv 2025
[24]

Knowledge fusion of large language models via modular skillpacks

Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, and Jing Li. Knowledge fusion of large language models via modular skillpacks. InICLR, 2026

work page 2026
[25]

Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025

Tianyi Feng, Jiaxuan Zhang, et al. Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025

work page arXiv 2025
[26]

Breaking the ceiling of the llm community by treating token generation as a classification for ensembling

Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu-Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. InFindings of EMNLP, 2024

work page 2024
[27]

Pack of llms: Model fusion at test-time via perplexity optimization

Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. InCOLM, 2024

work page 2024
[28]

Determine-then-ensemble: Necessity of top-k union for large language model ensembling

Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling. InICLR, 2025

work page 2025
[29]

Model stock: All we need is just a few fine-tuned models

Daehyeok Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. arXiv preprint arXiv:2403.19522, 2024

work page arXiv 2024
[30]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhi Wang, Li Shen, Shang Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. InICLR, 2024

work page 2024
[31]

Representation surgery for multi-task model merging

Enneng Yang, Li Shen, Zhi Wang, Guibing Guo, Xiaocong Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. InICML, 2024

work page 2024
[32]

Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

Zhaoyang Lu, Chengrun Fan, Wenhui Wei, Xiaoye Qu, Deli Chen, and Ying Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

work page arXiv 2024
[33]

Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

Yu He, Yucheng Hu, Yuqi Lin, Tian Zhang, and Hai Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

work page arXiv 2024
[34]

URL https: //doi.org/10.48550/arXiv.2502.02421

Arash Hosseini Nobari, Kian Alimohammadi, Ali ArjomandBigdeli, Aditi Srivastava, Faez Ahmed, and Navid Azizan. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421, 2025

work page arXiv 2025
[35]

Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024

Guangtai Du, Jaejun Lee, Jian Li, Ruochen Jiang, Yu Guo, Sihan Yu, Hongming Liu, Sinno Jialin Goh, Huan Tang, Dongmei He, and Min Zhang. Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024

work page arXiv 2024
[36]

Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

work page arXiv 2024
[37]

Fusionbench: A comprehensive benchmark of deep model fusion

Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024

work page arXiv 2024
[38]

Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,

Yu He, Siyao Zeng, Yucheng Hu, Ruichen Yang, Tian Zhang, and Hai Zhao. Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833, 2025

work page arXiv 2025
[39]

Instructuie: Multi-task instruction tuning for unified information extraction, 2023

Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, Jihua Kang, Jingsheng Yang, Siyuan Li, and Chunsai Du. Instructuie: Multi-task instruction tuning for unified information extraction, 2023. URLhttps://arxiv.org/abs/2304.08085

work page arXiv 2023
[40]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=rJ4km2R5t7

work page 2019
[41]

The Llama 3.1 Series of Models, 2024

The Llama 3.1 Team. The Llama 3.1 Series of Models, 2024. URLhttps://arxiv.org/abs/2407.18342

work page arXiv 2024
[42]

Qwen2.5 Technical Report

Qwen Team, An Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. URL https: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Equivariant architectures for learning in deep weight spaces

Aviv Navon, Aviv Shamsian, Idan Achituve, Ethan Fetaya, Gal Chechik, and Haggai Maron. Equivariant architectures for learning in deep weight spaces. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25790–25816. PMLR, 2023

work page 2023
[45]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021. URL https://proceedings.mlr.press/v139/jaegle21a.html

work page 2021
[46]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

work page 2022
[47]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 7319–7328, Online, 2021....

work page doi:10.18653/v1/2021.acl-long.568 2021
[48]

Ishan Deshpande, Ziyu Zhang, and Alexander G. Schwing. Generative modeling using the sliced wasserstein distance. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3483–3491, June 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Deshpande_ Generative_Modeling_Using_CVPR_2018_paper.html

work page 2018
[49]

Wasserstein auto-encoders

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= HkL7n1-0b. 11 A Theoretical and Empirical Analysis We now explain why this design is stable and suitable for heterogeneous fusion. Conceptual Hypothesis.Al...

work page arXiv 2018

[1] [1]

Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026

Mingyang Song and Mao Zheng. Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026

work page arXiv 2026

[2] [2]

Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

Weiqin Li, Yi Peng, Mengzhou Zhang, Lei Ding, Han Hu, and Li Shen. Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

work page arXiv 2023

[3] [3]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InUAI, 2018

work page 2018

[5] [5]

Loss surfaces, mode connectivity, and fast ensembling of dnns

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018

work page 2018

[6] [6]

Essentially no barriers in neural network energy landscape

Felix Draxler, Kambis V oss, Fred Hamprecht, and Ullrich Kothe. Essentially no barriers in neural network energy landscape. InNeurIPS, 2018

work page 2018

[7] [7]

Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Benjamin Recht, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022

work page 2022

[8] [8]

Merging models with fisher-weighted averaging

Michael Matena and Colin Raffel. Merging models with fisher-weighted averaging. InNeurIPS, 2022

work page 2022

[9] [9]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InICLR, 2023

work page 2023

[10] [10]

Model fusion via optimal transport

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. InNeurIPS, 2020

work page 2020

[11] [11]

Git re-basin: Merging models modulo permutation symmetries

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InICLR, 2023

work page 2023

[12] [12]

Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022

Keller Jordan, Hanie Sedghi, Oleg Saukh, Rickard Entezari, and Behnam Neyshabur. Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022

work page arXiv 2022

[13] [13]

Zipit! merging models from different tasks without training

George Stoica, Daniel Bolya, Jens Bjorner, Pratyusha Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training. InICML, 2023

work page 2023

[14] [14]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

work page 2023

[15] [15]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

work page 2023

[16] [16]

Yıldız, C., Ravichandran, N

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

work page arXiv 2023

[17] [17]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InECCV, 2024

work page 2024

[18] [18]

Della-merging: Reducing interference in model merging through magnitude-based sampling

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024

[19] [19]

Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024. 9

work page 2024

[20] [20]

Knowledge fusion of large language models

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. InICLR, 2024

work page 2024

[21] [21]

Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowledge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

work page arXiv 2024

[22] [22]

Bohdi: Heterogeneous llm fusion with automatic data exploration

Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, and Biqing Qi. Bohdi: Heterogeneous llm fusion with automatic data exploration. InNeurIPS, 2025

work page 2025

[23] [23]

Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

Zehao Yan et al. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

work page arXiv 2025

[24] [24]

Knowledge fusion of large language models via modular skillpacks

Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, and Jing Li. Knowledge fusion of large language models via modular skillpacks. InICLR, 2026

work page 2026

[25] [25]

Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025

Tianyi Feng, Jiaxuan Zhang, et al. Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025

work page arXiv 2025

[26] [26]

Breaking the ceiling of the llm community by treating token generation as a classification for ensembling

Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu-Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. InFindings of EMNLP, 2024

work page 2024

[27] [27]

Pack of llms: Model fusion at test-time via perplexity optimization

Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. InCOLM, 2024

work page 2024

[28] [28]

Determine-then-ensemble: Necessity of top-k union for large language model ensembling

Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling. InICLR, 2025

work page 2025

[29] [29]

Model stock: All we need is just a few fine-tuned models

Daehyeok Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. arXiv preprint arXiv:2403.19522, 2024

work page arXiv 2024

[30] [30]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhi Wang, Li Shen, Shang Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. InICLR, 2024

work page 2024

[31] [31]

Representation surgery for multi-task model merging

Enneng Yang, Li Shen, Zhi Wang, Guibing Guo, Xiaocong Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. InICML, 2024

work page 2024

[32] [32]

Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

Zhaoyang Lu, Chengrun Fan, Wenhui Wei, Xiaoye Qu, Deli Chen, and Ying Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

work page arXiv 2024

[33] [33]

Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

Yu He, Yucheng Hu, Yuqi Lin, Tian Zhang, and Hai Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

work page arXiv 2024

[34] [34]

URL https: //doi.org/10.48550/arXiv.2502.02421

Arash Hosseini Nobari, Kian Alimohammadi, Ali ArjomandBigdeli, Aditi Srivastava, Faez Ahmed, and Navid Azizan. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421, 2025

work page arXiv 2025

[35] [35]

Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024

Guangtai Du, Jaejun Lee, Jian Li, Ruochen Jiang, Yu Guo, Sihan Yu, Hongming Liu, Sinno Jialin Goh, Huan Tang, Dongmei He, and Min Zhang. Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024

work page arXiv 2024

[36] [36]

Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

work page arXiv 2024

[37] [37]

Fusionbench: A comprehensive benchmark of deep model fusion

Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024

work page arXiv 2024

[38] [38]

Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,

Yu He, Siyao Zeng, Yucheng Hu, Ruichen Yang, Tian Zhang, and Hai Zhao. Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833, 2025

work page arXiv 2025

[39] [39]

Instructuie: Multi-task instruction tuning for unified information extraction, 2023

Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, Jihua Kang, Jingsheng Yang, Siyuan Li, and Chunsai Du. Instructuie: Multi-task instruction tuning for unified information extraction, 2023. URLhttps://arxiv.org/abs/2304.08085

work page arXiv 2023

[40] [40]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=rJ4km2R5t7

work page 2019

[41] [41]

The Llama 3.1 Series of Models, 2024

The Llama 3.1 Team. The Llama 3.1 Series of Models, 2024. URLhttps://arxiv.org/abs/2407.18342

work page arXiv 2024

[42] [42]

Qwen2.5 Technical Report

Qwen Team, An Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. URL https: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Equivariant architectures for learning in deep weight spaces

Aviv Navon, Aviv Shamsian, Idan Achituve, Ethan Fetaya, Gal Chechik, and Haggai Maron. Equivariant architectures for learning in deep weight spaces. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25790–25816. PMLR, 2023

work page 2023

[45] [45]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021. URL https://proceedings.mlr.press/v139/jaegle21a.html

work page 2021

[46] [46]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

work page 2022

[47] [47]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 7319–7328, Online, 2021....

work page doi:10.18653/v1/2021.acl-long.568 2021

[48] [48]

Ishan Deshpande, Ziyu Zhang, and Alexander G. Schwing. Generative modeling using the sliced wasserstein distance. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3483–3491, June 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Deshpande_ Generative_Modeling_Using_CVPR_2018_paper.html

work page 2018

[49] [49]

Wasserstein auto-encoders

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= HkL7n1-0b. 11 A Theoretical and Empirical Analysis We now explain why this design is stable and suitable for heterogeneous fusion. Conceptual Hypothesis.Al...

work page arXiv 2018