pith. sign in

arxiv: 2604.01674 · v2 · pith:P3IV4HAInew · submitted 2026-04-02 · 💻 cs.AI

Can Heterogeneous Language Models Be Fused?

Pith reviewed 2026-05-21 10:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords heterogeneous model fusionlanguage model mergingtopology-based alignmentconflict-aware denoisingmulti-source fusioncross-family generalization
0
0 comments X

The pith

Heterogeneous language models can be fused by matching functional module structures instead of raw weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Model merging integrates expert models into one without the cost of running an ensemble, but it breaks down when the models come from different families such as Llama, Qwen, or Mistral. The paper introduces HeteroFusion to solve this by first aligning models through their functional module topology rather than exact parameter positions, then removing conflicting or noisy signals during the transfer. Analytical support shows that keeping the target model's basis while adding structured updates keeps the process stable. A reader would care because open ecosystems now contain many useful but architecturally mismatched experts, and successful fusion would let practitioners combine their strengths efficiently. The result is a single model that inherits complementary capabilities across heterogeneous sources.

Core claim

HeteroFusion consists of topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. Preserving the target adapter basis while predicting structured updates produces a stable and well-conditioned transfer process.

What carries the argument

topology-based alignment that transfers knowledge by matching functional module structures instead of raw tensor coordinates

If this is right

  • HeteroFusion outperforms strong merging, fusion, and ensemble baselines across heterogeneous transfer settings.
  • The method remains effective when fusing multiple sources from different model families.
  • It maintains performance even when some source models contain noise.
  • Cross-family generalization holds for architectures such as Llama, Qwen, and Mistral.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Structural module matching may become a practical criterion for selecting which experts to combine in open repositories.
  • The same principle could be tested on non-language models that share partial functional topologies.
  • Preserving the target's basis during updates might generalize to other forms of model editing or continual learning.

Load-bearing premise

That matching functional module structures instead of raw tensor coordinates enables stable and effective knowledge transfer despite architectural mismatch and latent basis misalignment.

What would settle it

If HeteroFusion shows no gains over baselines when applied to models whose functional modules share no common structure, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.01674 by Jie Zhou, Liang He, Qi Feng, Qin Chen, Shilian Chen, Wen Wu, Xin Li.

Figure 1
Figure 1. Figure 1: Overview of HeteroFusion. Topology-Based Alignment maps compatible modules into unified contexts, and Conflict-Aware Denoising filters cross-source noise. and FuseChat [21] distill the knowledge of structurally diverse source models into a target model through continual training and token/logit alignment. More recent heterogeneous fusion systems, such as Bohdi [22], InfiFusion [23], Modular SkillPacks [24]… view at source ↗
Figure 2
Figure 2. Figure 2: Results under noisy sources. Transferability in Noisy-Source Settings To evaluate noise robustness, we introduce four task-irrelevant Llama experts (specializing in Chemistry, Biology, Politics, and Science) into the source pool. This shared architecture allows homo￾geneous merging baselines to participate but creates a highly noisy environment. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis of α and µgate. EMR-Merging TIES-Merging Breadcrumbs DELLA DARE (TIES+) Task Arithmetic GAC Unite FuseLLM HeteroFusion 76 78 80 82 84 Average Score (GLUE) 79.66 78.90 81.04 77.98 77.94 79.54 76.68 80.02 81.36 82.58 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average performance on the GLUE bench￾mark. Transferability Across Diverse Task Families. To rigor￾ously validate the broad applicability of our transfer mechanism beyond UIE-style tasks, we further evaluate HeteroFusion on the diverse GLUE benchmark. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Influence of target anchor variants. Target Anchor Variants. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HeteroFusion for fusing heterogeneous language models drawn from distinct families (e.g., Llama, Qwen, Mistral). The approach comprises topology-based alignment, which transfers knowledge by matching functional module structures rather than raw tensor coordinates, and conflict-aware denoising to suppress incompatible signals. An analytical argument is offered that preserving the target adapter basis while predicting structured updates yields a stable, well-conditioned transfer. Empirical claims assert consistent outperformance over merging, fusion, and ensemble baselines across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings.

Significance. If the central claims hold, the work would meaningfully extend model merging beyond the homogeneous-backbone regime that currently dominates the literature, addressing a practical gap in open ecosystems where experts are architecturally diverse. The provision of an analytical justification for stability is a potential strength if it is parameter-free or derives well-conditioned properties directly from the module-matching construction.

major comments (2)
  1. [§3] §3 (Topology-based alignment): the central performance claim requires that topological module matching produces functionally aligned latent bases across families. The manuscript does not report any post-alignment verification (activation correlations, task-vector cosine similarity, or representational similarity analysis) that would confirm the matched modules implement isomorphic computations rather than merely sharing topological roles. Without such evidence, the subsequent conflict-aware denoising step cannot be guaranteed to suppress misalignment-induced noise.
  2. [§4] §4 (Analytical justification): the derivation that preserving the target adapter basis guarantees well-conditioned transfer implicitly treats topological correspondence as semantic alignment. If attention or MLP blocks realize non-isomorphic functions across Llama vs. Qwen, the transferred deltas remain in misaligned coordinates; the conditioning argument then reduces to an assumption rather than a proven property. A concrete counter-example or sensitivity analysis under controlled basis rotation would strengthen this section.
minor comments (2)
  1. [Abstract] Abstract and §5: the baselines are described only as 'strong merging, fusion, and ensemble baselines' without naming the specific methods or citing their original papers; explicit enumeration and hyper-parameter settings are required for reproducibility.
  2. [§5] §5 (Experiments): the abstract-only view provides no quantitative tables, error bars, or dataset descriptions. Full results with statistical significance tests and ablation of the two proposed components must be included to support the outperformance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the requirements for validating the topology-based alignment and strengthening the analytical justification. We address each point below and have revised the manuscript to incorporate additional empirical verification and sensitivity analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Topology-based alignment): the central performance claim requires that topological module matching produces functionally aligned latent bases across families. The manuscript does not report any post-alignment verification (activation correlations, task-vector cosine similarity, or representational similarity analysis) that would confirm the matched modules implement isomorphic computations rather than merely sharing topological roles. Without such evidence, the subsequent conflict-aware denoising step cannot be guaranteed to suppress misalignment-induced noise.

    Authors: We agree that direct post-alignment verification would provide stronger support for the claim that topological matching yields functionally aligned bases. In the revised manuscript, we have added representational similarity analysis (RSA) and activation correlation measurements between matched modules across families (Llama-Qwen and Llama-Mistral pairs). These results show substantially higher similarity for topologically matched modules compared to random or layer-index-based pairings, indicating that the alignment captures more than mere structural roles. We also include task-vector cosine similarities computed after alignment on held-out calibration data. These additions directly address the concern and support the role of conflict-aware denoising in mitigating residual misalignment. revision: yes

  2. Referee: [§4] §4 (Analytical justification): the derivation that preserving the target adapter basis guarantees well-conditioned transfer implicitly treats topological correspondence as semantic alignment. If attention or MLP blocks realize non-isomorphic functions across Llama vs. Qwen, the transferred deltas remain in misaligned coordinates; the conditioning argument then reduces to an assumption rather than a proven property. A concrete counter-example or sensitivity analysis under controlled basis rotation would strengthen this section.

    Authors: The analytical argument establishes well-conditioned transfer under the assumption that topological matching provides a meaningful correspondence, which is consistent with the cross-family empirical results. We acknowledge that the derivation does not independently prove semantic isomorphism. To address this, the revised manuscript now includes a controlled sensitivity analysis: we perturb the module matching by applying random basis rotations to simulate misalignment and demonstrate clear degradation in both conditioning metrics and downstream performance. This provides empirical evidence that the stability benefits depend on accurate topological correspondence rather than holding unconditionally. A full counter-example assuming completely non-isomorphic functions would require assumptions outside the paper's scope, but the added analysis strengthens the section as suggested. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent analytical support

full rationale

The paper introduces HeteroFusion via two explicit components (topology-based module matching and conflict-aware denoising) plus an analytical argument for basis preservation. These are presented as design choices justified by the heterogeneous setting rather than derived from or equivalent to the experimental outcomes. No equations or claims reduce a prediction to a fitted parameter by construction, and no load-bearing step relies on a self-citation chain that itself assumes the target result. The central performance claims rest on cross-setting benchmarks against external baselines, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method description does not introduce new postulated objects or fitted constants.

pith-pipeline@v0.9.0 · 5762 in / 1036 out tokens · 41692 ms · 2026-05-21T10:52:33.511803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [1]

    Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026

    Mingyang Song and Mao Zheng. Model merging in the era of large language models: Methods, applications, and future directions.arXiv preprint arXiv:2603.09938, 2026

  2. [2]

    Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

    Weiqin Li, Yi Peng, Mengzhou Zhang, Lei Ding, Han Hu, and Li Shen. Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

  3. [3]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024

  4. [4]

    Averaging weights leads to wider optima and better generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InUAI, 2018

  5. [5]

    Loss surfaces, mode connectivity, and fast ensembling of dnns

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018

  6. [6]

    Essentially no barriers in neural network energy landscape

    Felix Draxler, Kambis V oss, Fred Hamprecht, and Ullrich Kothe. Essentially no barriers in neural network energy landscape. InNeurIPS, 2018

  7. [7]

    Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Benjamin Recht, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022

  8. [8]

    Merging models with fisher-weighted averaging

    Michael Matena and Colin Raffel. Merging models with fisher-weighted averaging. InNeurIPS, 2022

  9. [9]

    Dataless knowledge fusion by merging weights of language models

    Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InICLR, 2023

  10. [10]

    Model fusion via optimal transport

    Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. InNeurIPS, 2020

  11. [11]

    Git re-basin: Merging models modulo permutation symmetries

    Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InICLR, 2023

  12. [12]

    Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022

    Keller Jordan, Hanie Sedghi, Oleg Saukh, Rickard Entezari, and Behnam Neyshabur. Repair: Renormalizing permuted activations for interpolation repair.arXiv preprint arXiv:2211.08403, 2022

  13. [13]

    Zipit! merging models from different tasks without training

    George Stoica, Daniel Bolya, Jens Bjorner, Pratyusha Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training. InICML, 2023

  14. [14]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

  15. [15]

    Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023

  16. [16]

    Yıldız, C., Ravichandran, N

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

  17. [17]

    Model breadcrumbs: Scaling multi-task model merging with sparse masks

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InECCV, 2024

  18. [18]

    Della-merging: Reducing interference in model merging through magnitude-based sampling

    Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024

  19. [19]

    Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

    Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024. 9

  20. [20]

    Knowledge fusion of large language models

    Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. InICLR, 2024

  21. [21]

    Fusechat: Knowl- edge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

    Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. Fusechat: Knowledge fusion of chat models.arXiv preprint arXiv:2408.07990, 2024

  22. [22]

    Bohdi: Heterogeneous llm fusion with automatic data exploration

    Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, and Biqing Qi. Bohdi: Heterogeneous llm fusion with automatic data exploration. InNeurIPS, 2025

  23. [23]

    Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

    Zehao Yan et al. Infifusion: A unified framework for enhanced cross-model reasoning via llm fusion.arXiv preprint arXiv:2501.02795, 2025

  24. [24]

    Knowledge fusion of large language models via modular skillpacks

    Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, and Jing Li. Knowledge fusion of large language models via modular skillpacks. InICLR, 2026

  25. [25]

    Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025

    Tianyi Feng, Jiaxuan Zhang, et al. Fusing llm capabilities with multi-llm log data.arXiv preprint arXiv:2507.10540, 2025

  26. [26]

    Breaking the ceiling of the llm community by treating token generation as a classification for ensembling

    Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu-Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. InFindings of EMNLP, 2024

  27. [27]

    Pack of llms: Model fusion at test-time via perplexity optimization

    Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. InCOLM, 2024

  28. [28]

    Determine-then-ensemble: Necessity of top-k union for large language model ensembling

    Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling. InICLR, 2025

  29. [29]

    Model stock: All we need is just a few fine-tuned models

    Daehyeok Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. arXiv preprint arXiv:2403.19522, 2024

  30. [30]

    Adamerging: Adaptive model merging for multi-task learning

    Enneng Yang, Zhi Wang, Li Shen, Shang Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. InICLR, 2024

  31. [31]

    Representation surgery for multi-task model merging

    Enneng Yang, Li Shen, Zhi Wang, Guibing Guo, Xiaocong Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. InICML, 2024

  32. [32]

    Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

    Zhaoyang Lu, Chengrun Fan, Wenhui Wei, Xiaoye Qu, Deli Chen, and Ying Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

  33. [33]

    Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

    Yu He, Yucheng Hu, Yuqi Lin, Tian Zhang, and Hai Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

  34. [34]

    URL https: //doi.org/10.48550/arXiv.2502.02421

    Arash Hosseini Nobari, Kian Alimohammadi, Ali ArjomandBigdeli, Aditi Srivastava, Faez Ahmed, and Navid Azizan. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421, 2025

  35. [35]

    Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024

    Guangtai Du, Jaejun Lee, Jian Li, Ruochen Jiang, Yu Guo, Sihan Yu, Hongming Liu, Sinno Jialin Goh, Huan Tang, Dongmei He, and Min Zhang. Parameter competition balancing for model merging.arXiv preprint arXiv:2410.02396, 2024

  36. [36]

    Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

    Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

  37. [37]

    Fusionbench: A comprehensive benchmark of deep model fusion

    Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024

  38. [38]

    Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,

    Yu He, Siyao Zeng, Yucheng Hu, Ruichen Yang, Tian Zhang, and Hai Zhao. Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833, 2025

  39. [39]

    Instructuie: Multi-task instruction tuning for unified information extraction, 2023

    Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, Jihua Kang, Jingsheng Yang, Siyuan Li, and Chunsai Du. Instructuie: Multi-task instruction tuning for unified information extraction, 2023. URLhttps://arxiv.org/abs/2304.08085

  40. [40]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=rJ4km2R5t7

  41. [41]

    The Llama 3.1 Series of Models, 2024

    The Llama 3.1 Team. The Llama 3.1 Series of Models, 2024. URLhttps://arxiv.org/abs/2407.18342

  42. [42]

    Qwen2.5 Technical Report

    Qwen Team, An Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. URL https: //arxiv.org/abs/2412.15115

  43. [43]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. 10

  44. [44]

    Equivariant architectures for learning in deep weight spaces

    Aviv Navon, Aviv Shamsian, Idan Achituve, Ethan Fetaya, Gal Chechik, and Haggai Maron. Equivariant architectures for learning in deep weight spaces. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25790–25816. PMLR, 2023

  45. [45]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021. URL https://proceedings.mlr.press/v139/jaegle21a.html

  46. [46]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

  47. [47]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 7319–7328, Online, 2021....

  48. [48]

    Ishan Deshpande, Ziyu Zhang, and Alexander G. Schwing. Generative modeling using the sliced wasserstein distance. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3483–3491, June 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Deshpande_ Generative_Modeling_Using_CVPR_2018_paper.html

  49. [49]

    Wasserstein auto-encoders

    Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= HkL7n1-0b. 11 A Theoretical and Empirical Analysis We now explain why this design is stable and suitable for heterogeneous fusion. Conceptual Hypothesis.Al...