Model Merging Scaling Laws in Large Language Models

Congkai Xie; Hongxia Yang; Jianbo Yuan; Qi Zhou; Xinyao Wang; Yanggan Gu; Yiming Zhang; Yuanyi Wang; Zhaoyi Yan

arxiv: 2509.24244 · v4 · submitted 2025-09-29 · 💻 cs.AI

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang , Yanggan Gu , Yiming Zhang , Qi Zhou , Zhaoyi Yan , Congkai Xie , Xinyao Wang , Jianbo Yuan

show 1 more author

Hongxia Yang

This is my paper

Pith reviewed 2026-05-18 13:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords model mergingscaling lawslarge language modelspower lawdiminishing returnsexpert compositioncross-entropy loss

0 comments

The pith

A compact power law links model size to the gains from merging more expert language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies an empirical scaling law for merged language models based on cross-entropy loss. The law features a size-dependent floor that drops as the base model grows larger and a tail that shows clear diminishing returns with each added expert. The same pattern appears across in-domain and cross-domain settings and matches results from several standard merging methods. A supporting theory accounts for the roughly one-over-k decline in gains by connecting it to base model properties and domain diversity. If the law holds, it lets practitioners predict the number of experts needed for a target loss and decide whether to scale the base model or add specialists under a fixed compute budget.

Core claim

We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains.

What carries the argument

The compact power law relating base model size to expert count, consisting of a capacity-dependent loss floor and a diminishing-return tail.

Load-bearing premise

The power-law shape and its parameters stay consistent when the merging method, model family, or domain mixture is altered.

What would settle it

Plot cross-entropy loss after merging different numbers of experts into base models of several sizes and check whether the curves follow the same power-law form with stable parameters across merging techniques.

Figures

Figures reproduced from arXiv: 2509.24244 by Congkai Xie, Hongxia Yang, Jianbo Yuan, Qi Zhou, Xinyao Wang, Yanggan Gu, Yiming Zhang, Yuanyi Wang, Zhaoyi Yan.

**Figure 2.** Figure 2: Overview of Merging vs MultiTask. The the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Merging Scaling Law in a single algebra domain. (left) CE vs. number of merged experts [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Larger models are easier to merge. (Left) Per-domain floors L∞(N) fall monotonically with model size N. (Middle) Tail amplitude A(N) is small and overall flat-to-decreasing with N. Most of the gain comes from the first few experts. (Right) Median fractional return R(k) with IQR band; k=5 and k=6 cross the 85%/90% thresholds, respectively. This means only 60% of experts in the expert pool can get over 90% p… view at source ↗

**Figure 5.** Figure 5: Method sensitivity is little at scale. Left: Mean CE vs. k at N=32B—all methods follow the power law; the early-k lead of TA/TIES(0.5) is small (∼1–2%) and narrows by k≳8. Right: Variance vs. k at N=32B, near-1/k contraction; TIES/TA < Average at small k, and all methods meet near the variance floor by k≈8. Curves show measurements (markers) and floor+tail fits (lines) with a shared small b per method. tai… view at source ↗

**Figure 6.** Figure 6: Effect of candidate-pool size. Two restricted-pool fits of the unified law (decreasing the number of candidates from M=9 to 8 and 7). Left (M=8), Right (M=7): floors L∞(N) are tight across domains; tails A(N) show weak or no shrinkage with N [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Predicting the k-curve from three points. Left: ground truth (markers) versus a floor+tail fit using only k={1, 2, 4} (lines) across representative domains/methods. Right: forecast error as MAPE across k (lower is better) and the distribution of recommended k ⋆ under ∆=1%, concentrated at 5−6. Together these show that three points suffice to recover the full curve and yield a practical early-stop k ⋆ . Fin… view at source ↗

**Figure 8.** Figure 8: Order sensitivity contracts with k (DARE). Left: At N=32B, the distribution of macro CE across merge orders (violins) tightens quickly as k increases; the whisker length shrinks by ∼83% from k=1 to k=8, while the median curve is monotone in k. Middle: Heatmap of across-order std over (N, k) shows a robust left-to-right decay at all scales, consistent with a near-1/(k+b) tail (larger N is also slightly dark… view at source ↗

**Figure 9.** Figure 9: Cross-backbone validation on LLaMA. Left: Macro CE vs. k on LLaMA-3.2 3B and LLaMA-3 8B, with floor+tail fits L∞ + A k+b showing the same inverse tail. Middle: Marginal gain ∆L(k) decays smoothly with k, consistent with the 1/(k+b) form. Right: Experts-to-target k ⋆ 80/90 concentrates at small k, echoing that most gains come early. differences: the relative range reduction at k=9 is ≈24% (0.5B), 32% (32B),… view at source ↗

**Figure 10.** Figure 10: Results for different numbers of merged experts on the 0.5B model. The base model is also considered one expert. We employ Algorithm 1 to perform sampling over model merge combinations, where dH denotes the Hamming distance [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Expert Post-training Scaling Law. Expert models performance improves as we increase [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Merging Scaling Law with the Averaging Method [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Merging Scaling Law with the TA Method [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Merging Scaling Law with the TIES Method [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Merging Scaling Law with the DARE Method [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Most of the gain comes from the first few experts. Left: Median fractional return R(k) with IQR band; k=5 and k=6 cross the 85%/90% thresholds, respectively. Right: k90 across domains and sizes concentrates at k ∈ {5, 6} (about half to two-thirds of this 9-expert pool (5/9≈56%)). only TIES with the strongest nonlinearity requires an extra bounded term +D(N) k k+q , with small D and stable q. We release pe… view at source ↗

**Figure 17.** Figure 17: Cross-domain synergy (DARE, 32B). Left: synergy heatmap Sd→e (red = help, blue = hurt) showing science↔science and math↔math blocks; cross-block entries are weakly negative; code→(discrete, geometry) is mildly positive. Right: representative top ± pairs (donor→receiver) highlight actionable donor choices for target domains. pair lists for 7B/14B/32B/72B as CSVs (out/rq6_synergy_matrix_32B_DARE.csv, out/rq… view at source ↗

**Figure 18.** Figure 18: Mean CE Loss vs. Model Size with Different [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

read the original abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fits a practical power law for merging gains that helps budget experts versus base size, but the form and exponents look post-hoc rather than independently derived from the sketched theory.

read the letter

The paper's core result is an empirical power law for cross-entropy after merging: the loss floor falls with larger base models while the tail drops roughly as 1/k with added experts. The pattern appears across Average, TA, TIES, and DARE, and it shows up both in-domain and cross-domain. Most gains arrive early and variance shrinks as k grows, which matches what many practitioners already notice but now gets quantified.

Referee Report

2 major / 2 minor

Summary. The paper claims to identify a compact power-law scaling law for language model merging measured by cross-entropy loss. The law links model size to a decreasing floor (improving with base-model capacity) and expert count to a diminishing tail (roughly 1/k returns). It is reported to hold across in-domain and cross-domain settings, multiple merging methods (Average, TA, TIES, DARE), and architectures, with a simple theory explaining the 1/k tail via base-model properties and domain diversity. The law is positioned as enabling predictive planning for the number of experts needed, when to stop adding them, and trade-offs with base-model scaling under fixed compute budgets.

Significance. If the functional form and exponents are shown to follow from the theory rather than post-hoc fitting, and if the law proves stable outside the tested methods and domains, the result would supply a much-needed quantitative rule for a widely used but heuristic technique. It would turn merging into a planable, computationally cheap complement to multitask training and support the broader claim of predictable gains from composing specialists.

major comments (2)

[Theory section] Theory section (around the derivation of the tail): The manuscript must demonstrate that the 1/k functional form and its exponent are predicted by the simple theory from assumptions about base-model properties and domain diversity before any curve fitting occurs. If the exponent is instead selected to match observed curves, the universality claim across merging methods and domain mixtures rests on interpolation rather than extrapolation and requires explicit validation on held-out architectures or mixtures.
[Experimental results] Experimental validation (e.g., the fits reported for Average/TA/TIES/DARE): Goodness-of-fit statistics (R², residual analysis, or cross-validation error) and parameter stability tests must be reported when the merging method, architecture family, or domain mixture is varied. Without these, the claim that the same power-law parameters remain stable cannot be assessed.

minor comments (2)

[Introduction] The exact functional form of the power law (e.g., loss = floor(N) + c / k^α) should be written explicitly in the introduction or abstract rather than described qualitatively.
[Figures] Figure captions for the scaling curves should include the fitted parameter values and the number of runs or seeds used to generate each point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying the theoretical derivation and strengthening the empirical validation as requested. Revisions have been made to improve clarity and rigor without altering the core claims.

read point-by-point responses

Referee: [Theory section] Theory section (around the derivation of the tail): The manuscript must demonstrate that the 1/k functional form and its exponent are predicted by the simple theory from assumptions about base-model properties and domain diversity before any curve fitting occurs. If the exponent is instead selected to match observed curves, the universality claim across merging methods and domain mixtures rests on interpolation rather than extrapolation and requires explicit validation on held-out architectures or mixtures.

Authors: We appreciate the referee's emphasis on establishing the theoretical prediction prior to fitting. The simple theory in Section 4 derives the 1/k tail directly from assumptions on base-model properties (specifically, the limited overlap in domain-specific parameters across base models) and domain diversity (modeled as a mixture with decreasing marginal contributions). The functional form and approximate exponent emerge analytically from averaging the cross-entropy contributions under these assumptions, before any reference to the empirical merging curves. To address the concern, we have reorganized the Theory section to present this derivation in full first, with explicit steps from the assumptions to the 1/k prediction. We then show that the empirical data align with this predicted form. Regarding held-out validation, we have added checks on additional domain mixtures not used in the primary fits, confirming consistency of the exponent. revision: yes
Referee: [Experimental results] Experimental validation (e.g., the fits reported for Average/TA/TIES/DARE): Goodness-of-fit statistics (R², residual analysis, or cross-validation error) and parameter stability tests must be reported when the merging method, architecture family, or domain mixture is varied. Without these, the claim that the same power-law parameters remain stable cannot be assessed.

Authors: We agree that quantitative goodness-of-fit measures and stability tests are necessary to substantiate the stability claims. In the revised manuscript, we have added these analyses in a new subsection of the Experimental Results. Specifically, we report R² values (all >0.92), residual plots demonstrating random scatter without systematic bias, and 5-fold cross-validation errors for the power-law fits across Average, TA, TIES, and DARE methods. Parameter stability is assessed by refitting on varied architecture families (e.g., Llama vs. Mistral) and domain mixtures, showing that the tail exponent remains within 0.9–1.1 across conditions with overlapping confidence intervals. These results are presented in updated Table 2 and new Supplementary Figures S3–S5. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed scaling law or theory.

full rationale

The paper empirically identifies a power-law form by fitting observed cross-entropy curves across model sizes, expert counts, architectures, and merging methods (Average/TA/TIES/DARE), then offers a simple post-hoc theory to explain the roughly 1/k tail and size-dependent floor in terms of base-model properties and domain diversity. This sequence is standard for scaling-law papers: data-driven functional form followed by explanatory narrative, with no equations shown that reduce the reported law or its parameters to the fitted inputs by algebraic identity. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to force the form; the law is presented as holding on the measured data rather than as an a-priori derivation. The derivation chain therefore remains self-contained as an empirical observation plus interpretation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the power-law form itself is presented as an empirical discovery whose parameters are presumably fitted.

pith-pipeline@v0.9.0 · 5772 in / 1021 out tokens · 23705 ms · 2026-05-18T13:28:42.948285+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

E[L | N, k] = L(θ0; N) + c g⊤µ + ½ c² µ⊤H µ + ½ c² Tr(H Σ) · 1/k + O(k^{-3/2})
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Average-case joint merging law) ... equal-normalization αi,k = c/k

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.
FeatCal: Feature Calibration for Post-Merging Models
cs.LG 2026-05 conditional novelty 7.0

FeatCal reduces feature drift in merged models via layer-wise closed-form calibration on a small dataset, outperforming prior post-merging methods on CLIP and GLUE benchmarks with high sample efficiency.
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
cs.CL 2026-05 unverdicted novelty 6.0

E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
cs.LG 2024-08 accept novelty 4.0

The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 5 Pith papers · 5 internal anchors

[1]

Under- standing scaling laws for recommendation models

Newsha Ardalani, Carole-Jean Wu, Zeliang Chen, Bhargav Bhushanam, and Adnan Aziz. Under- standing scaling laws for recommendation models. arXiv preprint arXiv:2208.08489,

work page arXiv
[2]

Beyond task vectors: Selective task arithmetic based on importance metrics

Tian Bowen, Lai Songning, Wu Jiemin, Shuai Zhihao, Ge Shiming, and Yue Yutao. Beyond task vectors: Selective task arithmetic based on importance metrics. arXiv preprint arXiv:2411.16139,

work page arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Scaling laws beyond backpropagation

Matthew J Filipovich, Alessandro Cappelli, Daniel Hesslow, and Julien Launay. Scaling laws beyond backpropagation. arXiv preprint arXiv:2210.14593,

work page arXiv
[5]

How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization

11 Under review as a conference paper Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, and Andrew Gordon Wilson. How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization. arXiv preprint arXiv:2210.06441,

work page arXiv
[6]

arXiv preprint arXiv:2505.13878 , year=

Yanggan Gu, Zhaoyi Yan, Yuanyi Wang, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878,

work page internal anchor Pith review arXiv
[7]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Scaling laws for single-agent reinforcement learning

Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442,

work page arXiv
[9]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. P Izmailov, AG Wilson, D Podoprikhin, D Vetrov, and T Garipov. Averaging weights leads to wider optima and better generalization. In 34th C...

work page 2018
[10]

Erasure coded neural network inference via fisher averaging

Divyansh Jhunjhunwala, Neharika Jali, Gauri Joshi, and Shiqiang Wang. Erasure coded neural network inference via fisher averaging. In 2024 IEEE International Symposium on Information Theory (ISIT), pp. 13–18. IEEE,

work page 2024
[11]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849,

work page arXiv
[12]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[13]

Scaling laws for deep learning based image reconstruction

Tobit Klug and Reinhard Heckel. Scaling laws for deep learning based image reconstruction. arXiv preprint arXiv:2209.13435,

work page arXiv
[14]

Mergebench/llama-3.2-3b-instruct_coding

MergeBench. Mergebench/llama-3.2-3b-instruct_coding. https://huggingface.co/ MergeBench/Llama-3.2-3B-Instruct_coding , 2025a. MergeBench. Mergebench/llama-3.2-3b-instruct_instruction. https://huggingface.co/ MergeBench/Llama-3.2-3B-Instruct_instruction , 2025b. MergeBench. Mergebench/llama-3.2-3b-instruct_math. https://huggingface.co/ MergeBench/Llama-3.2...

work page arXiv
[15]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele Rodola, Simone Calderara, and Angelo Porrello. Update your transformer to the latest release: Re-basin of task vectors. arXiv preprint arXiv:2505.22697,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Sewa: Selective weight average via probabilistic masking

13 Under review as a conference paper Peng Wang, Shengchao Hu, Zerui Tao, Guoxia Wang, Dianhai Yu, Li Shen, Quan Zheng, and Dacheng Tao. Sewa: Selective weight average via probabilistic masking. arXiv preprint arXiv:2502.10119, 2025a. Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distill...

work page arXiv
[17]

What matters for model merging at scale? arXiv preprint arXiv:2410.03617,

Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai. What matters for model merging at scale? arXiv preprint arXiv:2410.03617,

work page arXiv
[18]

Calm: Consensus-aware localized merging for multi-task learning

Kunda Yan, Min Zhang, Sen Cui, Zikun Qu, Bo Jiang, Feng Liu, and Changshui Zhang. Calm: Consensus-aware localized merging for multi-task learning. arXiv preprint arXiv:2506.13406,

work page arXiv
[19]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575,

work page arXiv
[20]

centred vectors yields covariance c2Σ/k

For the second moment, averaging k i.i.d. centred vectors yields covariance c2Σ/k. The p = 3 Marcinkiewicz–Zygmund (Ortega-Cerdà & Saludes, 2007; Ibragimov & Sharakhmetov,

work page 2007
[21]

□ B D ETAILED PROOF OF COROLLARY 1 We continue with the setting and notation of Appendix A

Equivalently, at the granularity used in the main text, E[L | N, k] = L∞(N) + A(N) k + ON 1 k3/2 , where the N-dependent constants (including the small base-point/curvature-surrogate discrepancies) are absorbed into L∞(N), A(N)—exactly the form fitted in our 2D scaling law. □ B D ETAILED PROOF OF COROLLARY 1 We continue with the setting and notation of Ap...

work page 2007
[22]

The results demonstrate that the sampled curves closely align with the full ones, both in terms of overall trend and numerical values

and those obtained from full merging combinations on the 0.5B model. The results demonstrate that the sampled curves closely align with the full ones, both in terms of overall trend and numerical values. E S CALING LAWS FOR EXPERT MODEL TRAINING In addition to investigating the scaling laws of model merging, we further examine the scaling behavior of ex- ...

work page 2020
[23]

Below we list parameters and N=72B predictions for k ∈ {1, 3, 5, 9}. Average. Parameters: 19 Under review as a conference paper (a) Algebra (b) Analysis (c) Discrete Math (d) Geometry (e) Number Theory (f) Physics (g) Chemistry (h) Biology (i) Code Figure 12: Merging Scaling Law with the Averaging Method Table 3: Joint (N, k) fit for Average (per-domain p...

work page 2052

[1] [1]

Under- standing scaling laws for recommendation models

Newsha Ardalani, Carole-Jean Wu, Zeliang Chen, Bhargav Bhushanam, and Adnan Aziz. Under- standing scaling laws for recommendation models. arXiv preprint arXiv:2208.08489,

work page arXiv

[2] [2]

Beyond task vectors: Selective task arithmetic based on importance metrics

Tian Bowen, Lai Songning, Wu Jiemin, Shuai Zhihao, Ge Shiming, and Yue Yutao. Beyond task vectors: Selective task arithmetic based on importance metrics. arXiv preprint arXiv:2411.16139,

work page arXiv

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Scaling laws beyond backpropagation

Matthew J Filipovich, Alessandro Cappelli, Daniel Hesslow, and Julien Launay. Scaling laws beyond backpropagation. arXiv preprint arXiv:2210.14593,

work page arXiv

[5] [5]

How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization

11 Under review as a conference paper Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, and Andrew Gordon Wilson. How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization. arXiv preprint arXiv:2210.06441,

work page arXiv

[6] [6]

arXiv preprint arXiv:2505.13878 , year=

Yanggan Gu, Zhaoyi Yan, Yuanyi Wang, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878,

work page internal anchor Pith review arXiv

[7] [7]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Scaling laws for single-agent reinforcement learning

Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442,

work page arXiv

[9] [9]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. P Izmailov, AG Wilson, D Podoprikhin, D Vetrov, and T Garipov. Averaging weights leads to wider optima and better generalization. In 34th C...

work page 2018

[10] [10]

Erasure coded neural network inference via fisher averaging

Divyansh Jhunjhunwala, Neharika Jali, Gauri Joshi, and Shiqiang Wang. Erasure coded neural network inference via fisher averaging. In 2024 IEEE International Symposium on Information Theory (ISIT), pp. 13–18. IEEE,

work page 2024

[11] [11]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849,

work page arXiv

[12] [12]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[13] [13]

Scaling laws for deep learning based image reconstruction

Tobit Klug and Reinhard Heckel. Scaling laws for deep learning based image reconstruction. arXiv preprint arXiv:2209.13435,

work page arXiv

[14] [14]

Mergebench/llama-3.2-3b-instruct_coding

MergeBench. Mergebench/llama-3.2-3b-instruct_coding. https://huggingface.co/ MergeBench/Llama-3.2-3B-Instruct_coding , 2025a. MergeBench. Mergebench/llama-3.2-3b-instruct_instruction. https://huggingface.co/ MergeBench/Llama-3.2-3B-Instruct_instruction , 2025b. MergeBench. Mergebench/llama-3.2-3b-instruct_math. https://huggingface.co/ MergeBench/Llama-3.2...

work page arXiv

[15] [15]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele Rodola, Simone Calderara, and Angelo Porrello. Update your transformer to the latest release: Re-basin of task vectors. arXiv preprint arXiv:2505.22697,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Sewa: Selective weight average via probabilistic masking

13 Under review as a conference paper Peng Wang, Shengchao Hu, Zerui Tao, Guoxia Wang, Dianhai Yu, Li Shen, Quan Zheng, and Dacheng Tao. Sewa: Selective weight average via probabilistic masking. arXiv preprint arXiv:2502.10119, 2025a. Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distill...

work page arXiv

[17] [17]

What matters for model merging at scale? arXiv preprint arXiv:2410.03617,

Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai. What matters for model merging at scale? arXiv preprint arXiv:2410.03617,

work page arXiv

[18] [18]

Calm: Consensus-aware localized merging for multi-task learning

Kunda Yan, Min Zhang, Sen Cui, Zikun Qu, Bo Jiang, Feng Liu, and Changshui Zhang. Calm: Consensus-aware localized merging for multi-task learning. arXiv preprint arXiv:2506.13406,

work page arXiv

[19] [19]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575,

work page arXiv

[20] [20]

centred vectors yields covariance c2Σ/k

For the second moment, averaging k i.i.d. centred vectors yields covariance c2Σ/k. The p = 3 Marcinkiewicz–Zygmund (Ortega-Cerdà & Saludes, 2007; Ibragimov & Sharakhmetov,

work page 2007

[21] [21]

□ B D ETAILED PROOF OF COROLLARY 1 We continue with the setting and notation of Appendix A

Equivalently, at the granularity used in the main text, E[L | N, k] = L∞(N) + A(N) k + ON 1 k3/2 , where the N-dependent constants (including the small base-point/curvature-surrogate discrepancies) are absorbed into L∞(N), A(N)—exactly the form fitted in our 2D scaling law. □ B D ETAILED PROOF OF COROLLARY 1 We continue with the setting and notation of Ap...

work page 2007

[22] [22]

The results demonstrate that the sampled curves closely align with the full ones, both in terms of overall trend and numerical values

and those obtained from full merging combinations on the 0.5B model. The results demonstrate that the sampled curves closely align with the full ones, both in terms of overall trend and numerical values. E S CALING LAWS FOR EXPERT MODEL TRAINING In addition to investigating the scaling laws of model merging, we further examine the scaling behavior of ex- ...

work page 2020

[23] [23]

Below we list parameters and N=72B predictions for k ∈ {1, 3, 5, 9}. Average. Parameters: 19 Under review as a conference paper (a) Algebra (b) Analysis (c) Discrete Math (d) Geometry (e) Number Theory (f) Physics (g) Chemistry (h) Biology (i) Code Figure 12: Merging Scaling Law with the Averaging Method Table 3: Joint (N, k) fit for Average (per-domain p...

work page 2052