Can Heterogeneous Language Models Be Fused?
Pith reviewed 2026-05-21 10:52 UTC · model grok-4.3
The pith
Heterogeneous language models can be fused by matching functional module structures instead of raw weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HeteroFusion consists of topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. Preserving the target adapter basis while predicting structured updates produces a stable and well-conditioned transfer process.
What carries the argument
topology-based alignment that transfers knowledge by matching functional module structures instead of raw tensor coordinates
If this is right
- HeteroFusion outperforms strong merging, fusion, and ensemble baselines across heterogeneous transfer settings.
- The method remains effective when fusing multiple sources from different model families.
- It maintains performance even when some source models contain noise.
- Cross-family generalization holds for architectures such as Llama, Qwen, and Mistral.
Where Pith is reading between the lines
- Structural module matching may become a practical criterion for selecting which experts to combine in open repositories.
- The same principle could be tested on non-language models that share partial functional topologies.
- Preserving the target's basis during updates might generalize to other forms of model editing or continual learning.
Load-bearing premise
That matching functional module structures instead of raw tensor coordinates enables stable and effective knowledge transfer despite architectural mismatch and latent basis misalignment.
What would settle it
If HeteroFusion shows no gains over baselines when applied to models whose functional modules share no common structure, the central claim would be falsified.
Figures
read the original abstract
Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HeteroFusion for fusing heterogeneous language models drawn from distinct families (e.g., Llama, Qwen, Mistral). The approach comprises topology-based alignment, which transfers knowledge by matching functional module structures rather than raw tensor coordinates, and conflict-aware denoising to suppress incompatible signals. An analytical argument is offered that preserving the target adapter basis while predicting structured updates yields a stable, well-conditioned transfer. Empirical claims assert consistent outperformance over merging, fusion, and ensemble baselines across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings.
Significance. If the central claims hold, the work would meaningfully extend model merging beyond the homogeneous-backbone regime that currently dominates the literature, addressing a practical gap in open ecosystems where experts are architecturally diverse. The provision of an analytical justification for stability is a potential strength if it is parameter-free or derives well-conditioned properties directly from the module-matching construction.
major comments (2)
- [§3] §3 (Topology-based alignment): the central performance claim requires that topological module matching produces functionally aligned latent bases across families. The manuscript does not report any post-alignment verification (activation correlations, task-vector cosine similarity, or representational similarity analysis) that would confirm the matched modules implement isomorphic computations rather than merely sharing topological roles. Without such evidence, the subsequent conflict-aware denoising step cannot be guaranteed to suppress misalignment-induced noise.
- [§4] §4 (Analytical justification): the derivation that preserving the target adapter basis guarantees well-conditioned transfer implicitly treats topological correspondence as semantic alignment. If attention or MLP blocks realize non-isomorphic functions across Llama vs. Qwen, the transferred deltas remain in misaligned coordinates; the conditioning argument then reduces to an assumption rather than a proven property. A concrete counter-example or sensitivity analysis under controlled basis rotation would strengthen this section.
minor comments (2)
- [Abstract] Abstract and §5: the baselines are described only as 'strong merging, fusion, and ensemble baselines' without naming the specific methods or citing their original papers; explicit enumeration and hyper-parameter settings are required for reproducibility.
- [§5] §5 (Experiments): the abstract-only view provides no quantitative tables, error bars, or dataset descriptions. Full results with statistical significance tests and ablation of the two proposed components must be included to support the outperformance claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the requirements for validating the topology-based alignment and strengthening the analytical justification. We address each point below and have revised the manuscript to incorporate additional empirical verification and sensitivity analysis.
read point-by-point responses
-
Referee: [§3] §3 (Topology-based alignment): the central performance claim requires that topological module matching produces functionally aligned latent bases across families. The manuscript does not report any post-alignment verification (activation correlations, task-vector cosine similarity, or representational similarity analysis) that would confirm the matched modules implement isomorphic computations rather than merely sharing topological roles. Without such evidence, the subsequent conflict-aware denoising step cannot be guaranteed to suppress misalignment-induced noise.
Authors: We agree that direct post-alignment verification would provide stronger support for the claim that topological matching yields functionally aligned bases. In the revised manuscript, we have added representational similarity analysis (RSA) and activation correlation measurements between matched modules across families (Llama-Qwen and Llama-Mistral pairs). These results show substantially higher similarity for topologically matched modules compared to random or layer-index-based pairings, indicating that the alignment captures more than mere structural roles. We also include task-vector cosine similarities computed after alignment on held-out calibration data. These additions directly address the concern and support the role of conflict-aware denoising in mitigating residual misalignment. revision: yes
-
Referee: [§4] §4 (Analytical justification): the derivation that preserving the target adapter basis guarantees well-conditioned transfer implicitly treats topological correspondence as semantic alignment. If attention or MLP blocks realize non-isomorphic functions across Llama vs. Qwen, the transferred deltas remain in misaligned coordinates; the conditioning argument then reduces to an assumption rather than a proven property. A concrete counter-example or sensitivity analysis under controlled basis rotation would strengthen this section.
Authors: The analytical argument establishes well-conditioned transfer under the assumption that topological matching provides a meaningful correspondence, which is consistent with the cross-family empirical results. We acknowledge that the derivation does not independently prove semantic isomorphism. To address this, the revised manuscript now includes a controlled sensitivity analysis: we perturb the module matching by applying random basis rotations to simulate misalignment and demonstrate clear degradation in both conditioning metrics and downstream performance. This provides empirical evidence that the stability benefits depend on accurate topological correspondence rather than holding unconditionally. A full counter-example assuming completely non-isomorphic functions would require assumptions outside the paper's scope, but the added analysis strengthens the section as suggested. revision: yes
Circularity Check
No significant circularity; empirical method with independent analytical support
full rationale
The paper introduces HeteroFusion via two explicit components (topology-based module matching and conflict-aware denoising) plus an analytical argument for basis preservation. These are presented as design choices justified by the heterogeneous setting rather than derived from or equivalent to the experimental outcomes. No equations or claims reduce a prediction to a fitted parameter by construction, and no load-bearing step relies on a self-citation chain that itself assumes the target result. The central performance claims rest on cross-setting benchmarks against external baselines, making the derivation self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
conflict-aware denoising that suppresses incompatible or noisy transfer signals
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mingyang Song and Mao Zheng. Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026
-
[2]
Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023
Weiqin Li, Yi Peng, Mengzhou Zhang, Lei Ding, Han Hu, and Li Shen. Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023
-
[3]
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Averaging weights leads to wider optima and better generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InUAI, 2018
work page 2018
-
[5]
Loss surfaces, mode connectivity, and fast ensembling of dnns
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018
work page 2018
-
[6]
Essentially no barriers in neural network energy landscape
Felix Draxler, Kambis V oss, Fred Hamprecht, and Ullrich Kothe. Essentially no barriers in neural network energy landscape. InNeurIPS, 2018
work page 2018
-
[7]
Mitchell Wortsman, Gabriel Ilharco, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Benjamin Recht, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022
work page 2022
-
[8]
Merging models with fisher-weighted averaging
Michael Matena and Colin Raffel. Merging models with fisher-weighted averaging. InNeurIPS, 2022
work page 2022
-
[9]
Dataless knowledge fusion by merging weights of language models
Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InICLR, 2023
work page 2023
-
[10]
Model fusion via optimal transport
Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. InNeurIPS, 2020
work page 2020
-
[11]
Git re-basin: Merging models modulo permutation symmetries
Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InICLR, 2023
work page 2023
-
[12]
Keller Jordan, Hanie Sedghi, Oleg Saukh, Rickard Entezari, and Behnam Neyshabur. Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022
-
[13]
Zipit! merging models from different tasks without training
George Stoica, Daniel Bolya, Jens Bjorner, Pratyusha Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training. InICML, 2023
work page 2023
-
[14]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023
work page 2023
-
[15]
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023
work page 2023
-
[16]
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023
-
[17]
Model breadcrumbs: Scaling multi-task model merging with sparse masks
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InECCV, 2024
work page 2024
-
[18]
Della-merging: Reducing interference in model merging through magnitude-based sampling
Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024
-
[19]
Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024. 9
work page 2024
-
[20]
Knowledge fusion of large language models
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. InICLR, 2024
work page 2024
-
[21]
Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024
Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowledge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024
-
[22]
Bohdi: Heterogeneous llm fusion with automatic data exploration
Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, and Biqing Qi. Bohdi: Heterogeneous llm fusion with automatic data exploration. InNeurIPS, 2025
work page 2025
-
[23]
Zehao Yan et al. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025
-
[24]
Knowledge fusion of large language models via modular skillpacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, and Jing Li. Knowledge fusion of large language models via modular skillpacks. InICLR, 2026
work page 2026
-
[25]
Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025
Tianyi Feng, Jiaxuan Zhang, et al. Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025
-
[26]
Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu-Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. InFindings of EMNLP, 2024
work page 2024
-
[27]
Pack of llms: Model fusion at test-time via perplexity optimization
Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. InCOLM, 2024
work page 2024
-
[28]
Determine-then-ensemble: Necessity of top-k union for large language model ensembling
Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling. InICLR, 2025
work page 2025
-
[29]
Model stock: All we need is just a few fine-tuned models
Daehyeok Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. arXiv preprint arXiv:2403.19522, 2024
-
[30]
Adamerging: Adaptive model merging for multi-task learning
Enneng Yang, Zhi Wang, Li Shen, Shang Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. InICLR, 2024
work page 2024
-
[31]
Representation surgery for multi-task model merging
Enneng Yang, Li Shen, Zhi Wang, Guibing Guo, Xiaocong Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. InICML, 2024
work page 2024
-
[32]
Zhaoyang Lu, Chengrun Fan, Wenhui Wei, Xiaoye Qu, Deli Chen, and Ying Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024
-
[33]
Yu He, Yucheng Hu, Yuqi Lin, Tian Zhang, and Hai Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024
-
[34]
URL https: //doi.org/10.48550/arXiv.2502.02421
Arash Hosseini Nobari, Kian Alimohammadi, Ali ArjomandBigdeli, Aditi Srivastava, Faez Ahmed, and Navid Azizan. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421, 2025
-
[35]
Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024
Guangtai Du, Jaejun Lee, Jian Li, Ruochen Jiang, Yu Guo, Sihan Yu, Hongming Liu, Sinno Jialin Goh, Huan Tang, Dongmei He, and Min Zhang. Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024
-
[36]
Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024
Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024
-
[37]
Fusionbench: A comprehensive benchmark of deep model fusion
Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024
-
[38]
Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,
Yu He, Siyao Zeng, Yucheng Hu, Ruichen Yang, Tian Zhang, and Hai Zhao. Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833, 2025
-
[39]
Instructuie: Multi-task instruction tuning for unified information extraction, 2023
Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, Jihua Kang, Jingsheng Yang, Siyuan Li, and Chunsai Du. Instructuie: Multi-task instruction tuning for unified information extraction, 2023. URLhttps://arxiv.org/abs/2304.08085
-
[40]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=rJ4km2R5t7
work page 2019
-
[41]
The Llama 3.1 Series of Models, 2024
The Llama 3.1 Team. The Llama 3.1 Series of Models, 2024. URLhttps://arxiv.org/abs/2407.18342
-
[42]
Qwen Team, An Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. URL https: //arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Equivariant architectures for learning in deep weight spaces
Aviv Navon, Aviv Shamsian, Idan Achituve, Ethan Fetaya, Gal Chechik, and Haggai Maron. Equivariant architectures for learning in deep weight spaces. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25790–25816. PMLR, 2023
work page 2023
-
[45]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021. URL https://proceedings.mlr.press/v139/jaegle21a.html
work page 2021
-
[46]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022
work page 2022
-
[47]
Intrinsic dimensionality explains the effectiveness of language model fine-tuning
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 7319–7328, Online, 2021....
-
[48]
Ishan Deshpande, Ziyu Zhang, and Alexander G. Schwing. Generative modeling using the sliced wasserstein distance. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3483–3491, June 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Deshpande_ Generative_Modeling_Using_CVPR_2018_paper.html
work page 2018
-
[49]
Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= HkL7n1-0b. 11 A Theoretical and Empirical Analysis We now explain why this design is stable and suitable for heterogeneous fusion. Conceptual Hypothesis.Al...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.