pith. machine review for the scientific record. sign in

arxiv: 2509.24244 · v4 · submitted 2025-09-29 · 💻 cs.AI

Recognition: unknown

Model Merging Scaling Laws in Large Language Models

Authors on Pith no claims yet
classification 💻 cs.AI
keywords modelexpertsmergingscalinggainsacrossaddingbase
0
0 comments X
read the original abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.

  2. FeatCal: Feature Calibration for Post-Merging Models

    cs.LG 2026-05 conditional novelty 7.0

    FeatCal reduces feature drift in merged models via layer-wise closed-form calibration on a small dataset, outperforming prior post-merging methods on CLIP and GLUE benchmarks with high sample efficiency.

  3. Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.