Model Merging Scaling Laws in Large Language Models

· 2025 · cs.AI · arXiv 2509.24244

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

citation-role summary

background 1 other 1

citation-polarity summary

background 1 unclear 1

representative citing papers

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.

FeatCal: Feature Calibration for Post-Merging Models

cs.LG · 2026-05-13 · conditional · novelty 7.0

FeatCal reduces feature drift in merged models via layer-wise closed-form calibration on a small dataset, outperforming prior post-merging methods on CLIP and GLUE benchmarks with high sample efficiency.

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

cs.LG · 2024-08-14 · accept · novelty 4.0

The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.

citing papers explorer

Showing 5 of 5 citing papers.

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts cs.LG · 2026-05-14 · unverdicted · none · ref 41 · internal anchor
Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.
FeatCal: Feature Calibration for Post-Merging Models cs.LG · 2026-05-13 · conditional · none · ref 58 · internal anchor
FeatCal reduces feature drift in merged models via layer-wise closed-form calibration on a small dataset, outperforming prior post-merging methods on CLIP and GLUE benchmarks with high sample efficiency.
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring cs.CL · 2026-05-16 · unverdicted · none · ref 7 · internal anchor
E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training cs.LG · 2026-05-10 · unverdicted · none · ref 19 · internal anchor
Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities cs.LG · 2024-08-14 · accept · none · ref 244 · internal anchor
The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.

Model Merging Scaling Laws in Large Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer