Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies

Ryan Gillespie

arxiv: 2605.19373 · v1 · pith:CGLCQKAVnew · submitted 2026-05-16 · 💻 cs.DC · cs.AI· cs.LG

Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies

Ryan Gillespie This is my paper

Pith reviewed 2026-05-20 15:38 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords CRDTneural network mergingstrong eventual consistencydistributed model mergingconflict-free data typesOR-Set semanticsMerkle root seeding

0 comments

The pith

A two-layer CRDT wrapper enables any neural network merge strategy to achieve strong eventual consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural network model merging strategies such as weight averaging and SLERP lack the commutativity, associativity, and idempotency needed for reliable operation in distributed environments where update order varies. The paper establishes that this is a fundamental issue for normalization-based methods. It then introduces a separation where contributions are first gathered using a set union operation that satisfies those properties, followed by applying the chosen strategy in a fixed manner on the collected set. This ensures that any replicas with identical contributions will compute the exact same merged model, independent of message sequence. The design preserves the original merge behavior while adding distributed consistency.

Core claim

The paper claims that by using a two-layer architecture called CRDTMergeState, with the first layer handling contributions through OR-Set CRDT semantics based on set union and the second layer executing merge strategies as deterministic pure functions over a canonically ordered contribution set with randomness seeded from the Merkle root, strong eventual consistency is guaranteed for model merging across replicas.

What carries the argument

CRDTMergeState, a two-layer wrapper that uses OR-Set for collecting contributions via set union in the first layer and applies merge strategies deterministically in the second layer.

If this is right

Replicas converge to identical merged models given the same contributions, independent of order.
The wrapper is transparent, so the merged model's performance matches the original strategy by construction.
Tests confirm the properties hold for models up to 7 billion parameters and under network partitions.
Any of the 26 strategies can be used without modification to their core logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could support decentralized model merging in collaborative AI projects without central servers.
Similar two-layer designs might help other non-commutative operations in machine learning become order-independent.
Examining the effect of Merkle seeding on strategies with internal randomness could be a next step.

Load-bearing premise

That any merge strategy can be wrapped as a deterministic pure function over a canonically ordered contribution set with randomness seeded from the Merkle root without altering its intended behavior or introducing new inconsistencies.

What would settle it

If replicas receiving identical contributions but in different orders produce merged models with differing parameters, the consistency proof would be invalidated.

read the original abstract

All 26 neural network merge strategies we tested including weight averaging, SLERP, TIES, DARE, Fisher merging, and evolutionary approaches -- fail the algebraic properties (commutativity, associativity, idempotency) required for conflict-free distributed operation. We prove that this failure is structural: normalisation-based merges cannot simultaneously satisfy all three properties. To resolve this, we present a two-layer architecture -- CRDTMergeState -- that wraps any merge strategy in a CRDT-compliant (Conflict-Free Replicated Data Type) layer. Layer 1 manages contributions via OR-Set CRDT semantics, where the merge operation is set union -- trivially commutative, associative, and idempotent. Layer 2 applies merge strategies as deterministic pure functions over a canonically-ordered contribution set, with randomness seeded from the Merkle root. We prove that this separation guarantees Strong Eventual Consistency: all replicas receiving the same contributions compute identical merged models, regardless of message ordering. Empirical validation spans three tiers: controlled 4x4 tensors (104/104 tests pass), production-scale models up to 7.24B parameters (208 strategy-level tests, 43,368 layer-level property checks at capped tensor resolution), and multi-node convergence under gossip and partition healing (100 nodes, 20 orderings), with CRDT overhead below 0.5 ms. Because the wrapper is transparent, downstream performance is identical by construction, confirmed via byte-identical output verification. The reference implementation is available as crdt-merge v0.9.4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean two-layer CRDT wrapper that turns any model merge strategy into something that delivers strong eventual consistency across replicas.

read the letter

The main takeaway is that the authors separate contribution collection from strategy execution so that standard merge methods can run in a replicated setting without conflicts. They first show that none of the 26 strategies satisfy the algebraic properties CRDTs need, and they give a structural argument that normalization steps prevent commutativity, associativity, and idempotency from holding together. The fix is straightforward: an OR-Set layer gathers updates in the usual conflict-free manner, then a second layer sorts the collected contributions canonically, seeds randomness from the Merkle root, and applies the original strategy as a pure function. This guarantees identical outputs once the same set of contributions arrives, independent of order. The experiments cover small tensors, models up to 7B parameters, and multi-node gossip with partitions, and they report byte-identical results with low overhead. The reference code is public, which helps. The soft spot is the claim that the determinism wrapper leaves every strategy's intended behavior unchanged. Most cases look fine, but strategies with internal randomness or sensitivity to ordering could shift in subtle ways under the new seeding and sorting rules; the paper asserts the tests rule this out, yet the full proof details and any edge-case handling would need checking. This is useful for people building federated or distributed training systems who need reliable merging across nodes. It is a practical bridge between CRDT techniques and existing merge methods rather than a new merge algorithm itself. I would send it to peer review because the consistency argument is testable, the implementation is available for inspection, and the empirical scale is reasonable for the claims.

Referee Report

1 major / 4 minor

Summary. The manuscript claims that all 26 tested neural network merge strategies (weight averaging, SLERP, TIES, DARE, Fisher, evolutionary) fail the algebraic properties of commutativity, associativity, and idempotency required for CRDT operation. It proves this failure is structural for normalisation-based merges. To address it, the authors introduce a two-layer CRDTMergeState architecture: Layer 1 uses standard OR-Set CRDT semantics for contribution management (set union), while Layer 2 applies any merge strategy as a deterministic pure function over a canonically ordered contribution set with randomness seeded from the Merkle root. They prove this separation yields Strong Eventual Consistency (identical merged models for identical contribution sets regardless of order). Empirical results include 104/104 controlled tensor tests, 208 strategy-level and 43,368 layer-level checks on models up to 7.24B parameters, and 100-node multi-ordering convergence tests, with <0.5 ms overhead and byte-identical outputs; a reference implementation (crdt-merge v0.9.4) is provided.

Significance. If the central construction holds, the work enables arbitrary neural-network merge strategies to be used safely inside replicated distributed systems while inheriting CRDT consistency guarantees. This is a meaningful bridge between model-merging literature and distributed-systems primitives. Credit is due for the explicit structural impossibility argument, the separation that re-uses standard OR-Set properties, the scale of the empirical validation (including production-scale models and partition-healing scenarios), and the release of reproducible code that permits byte-for-byte verification.

major comments (1)

[§3.2] §3.2 (Determinism construction): The claim that Merkle-root seeding plus canonical ordering renders any of the 26 strategies a pure deterministic function without altering intended behaviour is load-bearing for the SEC proof. The manuscript should supply a short argument or counter-example showing that this transformation preserves the semantic intent of inherently stochastic strategies (e.g., certain evolutionary or DARE variants) rather than merely producing byte-identical outputs on the tested seeds.

minor comments (4)

[Abstract] Abstract, line 3: the phrase 'normalisation-based merges' is used before it is defined; a parenthetical gloss or forward reference to §4.1 would improve readability.
[Table 2] Table 2 (layer-level property checks): the caption states '43,368 checks' but the column sums appear to total 43,200; a brief reconciliation note or corrected count would eliminate the discrepancy.
[§6.3] §6.3 (multi-node experiments): the gossip and partition-healing scenarios are described at a high level; adding the exact message-delivery schedule or pseudocode for the 20 orderings would aid reproducibility.
[References] Reference list: the CRDT foundational citations (Shapiro et al., 2011; Preguiça et al.) are present, but recent surveys on model merging (e.g., in federated or decentralised learning) are absent; adding two or three would better situate the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of the work's significance as a bridge between model merging and CRDTs, and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§3.2] §3.2 (Determinism construction): The claim that Merkle-root seeding plus canonical ordering renders any of the 26 strategies a pure deterministic function without altering intended behaviour is load-bearing for the SEC proof. The manuscript should supply a short argument or counter-example showing that this transformation preserves the semantic intent of inherently stochastic strategies (e.g., certain evolutionary or DARE variants) rather than merely producing byte-identical outputs on the tested seeds.

Authors: We agree that an explicit clarification strengthens the load-bearing claim in §3.2. In the revised manuscript we will insert a concise paragraph arguing that the Merkle-root seeding plus canonical ordering produces a deterministic pure function while preserving semantic intent for stochastic strategies. The argument is as follows: stochastic elements in strategies such as evolutionary merging or DARE variants (e.g., random perturbations, dropout masks, or tie-breaking) are intended to generate a specific merge outcome from a given input set rather than to produce non-reproducible results across replicas. Deriving the seed from the Merkle root of the canonically ordered contribution set fixes the random choices to a value that is a deterministic function of the input set itself. Consequently, every replica that receives the identical contribution set executes the identical sequence of stochastic operations and obtains the identical merged model, satisfying SEC. This does not alter the strategy's intended behaviour for that set; it merely makes the behaviour reproducible, which is a prerequisite for any CRDT-compliant wrapper. As a counter-example, consider a DARE variant that applies random weight dropout: the Merkle-derived seed yields the same dropout mask for any replica holding the same ordered set, producing the same output model that the original stochastic procedure would have produced under that fixed seed. Our existing empirical results (byte-identical outputs across 20 orderings on 100-node tests and 43,368 layer-level checks) already confirm that the transformation yields the expected merge for each contribution set. We will add this short argument and counter-example to §3.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation separates contribution management (Layer 1, using standard OR-Set CRDT union which is independently known to be commutative, associative, and idempotent) from strategy application (Layer 2, as a deterministic pure function on a canonically ordered set with Merkle-root seeding). The Strong Eventual Consistency guarantee follows directly from these external algebraic properties plus the added determinism, without any reduction of the result to a fitted parameter, self-definition, or self-citation chain. The structural failure of the 26 raw strategies is shown separately via algebraic counterexamples, and downstream equivalence is confirmed by byte-identical verification, rendering the argument self-contained against standard CRDT benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that standard merge strategies remain semantically unchanged when executed deterministically over an ordered set; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Merge strategies can be executed as deterministic pure functions once contributions are canonically ordered and randomness is seeded from the Merkle root
Invoked to guarantee identical output across replicas.

invented entities (1)

CRDTMergeState no independent evidence
purpose: Two-layer wrapper providing CRDT semantics around arbitrary merge strategies
New architectural construct introduced to separate collection from application.

pith-pipeline@v0.9.0 · 5820 in / 1208 out tokens · 41116 ms · 2026-05-20T15:38:49.233853+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Layer 1 manages contributions via OR-Set CRDT semantics, where the merge operation is set union—trivially commutative, associative, and idempotent. Layer 2 applies merge strategies as deterministic pure functions over a canonically-ordered contribution set, with randomness seeded from the Merkle root.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that this separation guarantees Strong Eventual Consistency: all replicas receiving the same contributions compute identical merged models, regardless of message ordering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

work page 2025
[2]

Delta state replicated data types.Journal of Parallel and Distributed Computing, 111:162–173, 2018

Paulo Sérgio Almeida, Ali Shoker, and Carlos Baquero. Delta state replicated data types.Journal of Parallel and Distributed Computing, 111:162–173, 2018

work page 2018
[3]

Making operation-based CRDTs operation- based

Carlos Baquero, Paulo Sérgio Almeida, and Ali Shoker. Making operation-based CRDTs operation- based. InDistributed Applications and Interoperable Systems – 14th IFIP WG 6.1 International Con- ference (DAIS), volume 8460 ofLecture Notes in Computer Science, pages 126–140. Springer, 2014

work page 2014
[4]

Machine learning with adversaries: Byzantine tolerant gradient descent

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guer- raoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30 (NeurIPS), pages 119–129, 2017

work page 2017
[5]

Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. InProceedings of Machine Learning and Systems (MLSys), 2019

work page 2019
[6]

Model breadcrumbs: Scaling multi-task model merg- ing with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merg- ing with sparse masks. InComputer Vision – ECCV 2024, volume 15133 ofLecture Notes in Computer Science, pages 270–287. Springer, 2024

work page 2024
[7]

Dynamo: Amazon’s highly available key-value store

Giuseppe DeCandia, Deniz Hastorun, Madan Jam- pani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. InProceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), pages 205–220, 2007

work page 2007
[8]

Della-merging: Reducing interference in model merging through magnitude-based sampling

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. DELLA-merging: Reducing interference in model merging through magnitude-based sampling. arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024
[9]

Method and system for conflict- free merging of neural network model parameters using convergent replicated data types

Ryan Gillespie. Method and system for conflict- free merging of neural network model parameters using convergent replicated data types. UK Patent Application No. GB2607132.4, filed 30 March 2026

work page 2026
[10]

Arcee’s MergeKit: A toolkit for merging large lan- guage models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large lan- guage models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing: Industry Track (EMNLP Industry Track), 2024

work page 2024
[11]

EMR-merging: Tuning-free high-performance model merging

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. EMR-merging: Tuning-free high-performance model merging. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

work page 2024
[12]

Editing models with task arithmetic

GabrielIlharco, MarcoTulioRibeiro, MitchellWorts- man, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[13]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preoţiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Representa- tions (ICLR), 2023. 10

work page 2023
[15]

Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Gra- ham Cormode, Rachel Cummings, Rafael G

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Gra- ham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, C...

work page 2021
[16]

Git- theta: A git extension for collaborative development of machine learning models

Nikhil Kandpal, Brian Lester, Mohammed Muqeeth, Anisha Mascarenhas, Monty Evans, Vishal Baskaran, Tenghao Huang, Haokun Liu, and Colin Raffel. Git- theta: A git extension for collaborative development of machine learning models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[17]

Beresford

Martin Kleppmann and Alastair R. Beresford. A conflict-free replicated JSON datatype.IEEE Transactions on Parallel and Distributed Systems, 28(10):2733–2746, 2017

work page 2017
[18]

Stich, and Martin Jaggi

Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

work page 2019
[19]

Time, clocks, and the ordering of events in a distributed system.Communications of the ACM, 21(7):558–565, 1978

Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.Communications of the ACM, 21(7):558–565, 1978

work page 1978
[20]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems (MLSys), 2020

work page 2020
[21]

Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gra- dient descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gra- dient descent. InAdvances in Neural Information Processing Systems 30 (NeurIPS), 2017

work page 2017
[22]

Matena and Colin Raffel

Michael S. Matena and Colin Raffel. Merging models with Fisher-weighted averaging. InAdvances in Neu- ral Information Processing Systems 35 (NeurIPS), 2022

work page 2022
[23]

Brendan McMahan, Eider Moore, Daniel Ra- mage, Seth Hampson, and Blaise Agüera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ra- mage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

work page 2017
[24]

Conflict-free replicated data types (CRDTs)

Nuno Preguiça, Carlos Baquero, and Marc Shapiro. Conflict-free replicated data types (CRDTs). In Encyclopedia of Big Data Technologies. Springer, 2018

work page 2018
[25]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019

work page 2019
[26]

Merkle-CRDTs: Merkle-DAGs meet CRDTs.arXiv preprint arXiv:2004.00107, 2020

Hector Sanjuan, Samuli Poyhtari, Pedro Teixeira, and Ioannis Psaras. Merkle-CRDTs: Merkle-DAGs meet CRDTs.arXiv preprint arXiv:2004.00107, 2020

work page arXiv 2004
[27]

Schneider

Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial.ACM Computing Surveys, 22(4):299–319, 1990

work page 1990
[28]

A comprehensive study of convergent and commutative replicated data types

Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A comprehensive study of convergent and commutative replicated data types. Technical Report RR-7506, INRIA, 2011

work page 2011
[29]

Conflict-free replicated data types

Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Conflict-free replicated data types. InProceedings of the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), volume 6976 ofLecture Notes in Computer Science, pages 386–400. Springer, 2011

work page 2011
[30]

Animating rotation with quaternion curves

Ken Shoemake. Animating rotation with quaternion curves. InProceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 245–254, 1985

work page 1985
[31]

Eventually consistent.Communica- tions of the ACM, 52(1):40–44, 2009

Werner Vogels. Eventually consistent.Communica- tions of the ACM, 52(1):40–44, 2009

work page 2009
[32]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine- tuned models improves accuracy without increasing inference time. InProceedings of the 39th Inter- national Confer...

work page 2022
[33]

TIES-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. In Advances in Neural Information Processing Systems 36 (NeurIPS), 2023

work page 2023
[34]

Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities.ACM Computing Surveys, 58(8), 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities.ACM Computing Surveys, 58(8), 2026

work page 2026
[35]

Representationsurgeryformulti-taskmodelmerging

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representationsurgeryformulti-taskmodelmerging. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 56332–56356, 2024

work page 2024
[36]

AdaMerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. AdaMerging: Adaptive model merging for multi-task learning. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[37]

nearly associative

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super Mario: Absorbing abilities from homologous models as a free lunch. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. 12 A Controlled Verification Results This appendix presents the full per-strategy results for Tier 1 (controlled4 ×4tensor)...

work page 2024
[38]

When N1 and N2 synchronise (in either order), both compute merge(S′ 1,S′

work page
[39]

= merge(S′ 2,S′ 1)by commutativity [29]

work page
[40]

Both nodes now have identical visible sets:{θ1,θ2}

work page
[41]

For multi-party convergence with k > 2nodes, associativity guarantees that the order of pairwise state merges does not affect the final state [28]

Both nodes call resolve(·,σ,·), sorting by hash, seeding randomness identically, and obtaining the same merged modelθ∗. For multi-party convergence with k > 2nodes, associativity guarantees that the order of pairwise state merges does not affect the final state [28]. Whether node N3 merges first withN1 or N2, the final visible set—and therefore the resolv...

work page
[42]

Gossip time grows quadratically in the number of nodes (reflecting all-pairs state exchange), while per-call merge() cost remains constant in tensor size. As noted in Section 6.5, this prototype gossip protocol is designed for validation purposes; production deployments beyond ∼50nodes would benefit from optimised dissemination protocols. 17 T able 5:Hugg...

work page

[1] [1]

Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

work page 2025

[2] [2]

Delta state replicated data types.Journal of Parallel and Distributed Computing, 111:162–173, 2018

Paulo Sérgio Almeida, Ali Shoker, and Carlos Baquero. Delta state replicated data types.Journal of Parallel and Distributed Computing, 111:162–173, 2018

work page 2018

[3] [3]

Making operation-based CRDTs operation- based

Carlos Baquero, Paulo Sérgio Almeida, and Ali Shoker. Making operation-based CRDTs operation- based. InDistributed Applications and Interoperable Systems – 14th IFIP WG 6.1 International Con- ference (DAIS), volume 8460 ofLecture Notes in Computer Science, pages 126–140. Springer, 2014

work page 2014

[4] [4]

Machine learning with adversaries: Byzantine tolerant gradient descent

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guer- raoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30 (NeurIPS), pages 119–129, 2017

work page 2017

[5] [5]

Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. InProceedings of Machine Learning and Systems (MLSys), 2019

work page 2019

[6] [6]

Model breadcrumbs: Scaling multi-task model merg- ing with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merg- ing with sparse masks. InComputer Vision – ECCV 2024, volume 15133 ofLecture Notes in Computer Science, pages 270–287. Springer, 2024

work page 2024

[7] [7]

Dynamo: Amazon’s highly available key-value store

Giuseppe DeCandia, Deniz Hastorun, Madan Jam- pani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. InProceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), pages 205–220, 2007

work page 2007

[8] [8]

Della-merging: Reducing interference in model merging through magnitude-based sampling

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. DELLA-merging: Reducing interference in model merging through magnitude-based sampling. arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024

[9] [9]

Method and system for conflict- free merging of neural network model parameters using convergent replicated data types

Ryan Gillespie. Method and system for conflict- free merging of neural network model parameters using convergent replicated data types. UK Patent Application No. GB2607132.4, filed 30 March 2026

work page 2026

[10] [10]

Arcee’s MergeKit: A toolkit for merging large lan- guage models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large lan- guage models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing: Industry Track (EMNLP Industry Track), 2024

work page 2024

[11] [11]

EMR-merging: Tuning-free high-performance model merging

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. EMR-merging: Tuning-free high-performance model merging. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

work page 2024

[12] [12]

Editing models with task arithmetic

GabrielIlharco, MarcoTulioRibeiro, MitchellWorts- man, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023

[13] [13]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preoţiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Representa- tions (ICLR), 2023. 10

work page 2023

[15] [15]

Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Gra- ham Cormode, Rachel Cummings, Rafael G

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Gra- ham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, C...

work page 2021

[16] [16]

Git- theta: A git extension for collaborative development of machine learning models

Nikhil Kandpal, Brian Lester, Mohammed Muqeeth, Anisha Mascarenhas, Monty Evans, Vishal Baskaran, Tenghao Huang, Haokun Liu, and Colin Raffel. Git- theta: A git extension for collaborative development of machine learning models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023

[17] [17]

Beresford

Martin Kleppmann and Alastair R. Beresford. A conflict-free replicated JSON datatype.IEEE Transactions on Parallel and Distributed Systems, 28(10):2733–2746, 2017

work page 2017

[18] [18]

Stich, and Martin Jaggi

Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

work page 2019

[19] [19]

Time, clocks, and the ordering of events in a distributed system.Communications of the ACM, 21(7):558–565, 1978

Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.Communications of the ACM, 21(7):558–565, 1978

work page 1978

[20] [20]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems (MLSys), 2020

work page 2020

[21] [21]

Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gra- dient descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gra- dient descent. InAdvances in Neural Information Processing Systems 30 (NeurIPS), 2017

work page 2017

[22] [22]

Matena and Colin Raffel

Michael S. Matena and Colin Raffel. Merging models with Fisher-weighted averaging. InAdvances in Neu- ral Information Processing Systems 35 (NeurIPS), 2022

work page 2022

[23] [23]

Brendan McMahan, Eider Moore, Daniel Ra- mage, Seth Hampson, and Blaise Agüera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ra- mage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

work page 2017

[24] [24]

Conflict-free replicated data types (CRDTs)

Nuno Preguiça, Carlos Baquero, and Marc Shapiro. Conflict-free replicated data types (CRDTs). In Encyclopedia of Big Data Technologies. Springer, 2018

work page 2018

[25] [25]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019

work page 2019

[26] [26]

Merkle-CRDTs: Merkle-DAGs meet CRDTs.arXiv preprint arXiv:2004.00107, 2020

Hector Sanjuan, Samuli Poyhtari, Pedro Teixeira, and Ioannis Psaras. Merkle-CRDTs: Merkle-DAGs meet CRDTs.arXiv preprint arXiv:2004.00107, 2020

work page arXiv 2004

[27] [27]

Schneider

Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial.ACM Computing Surveys, 22(4):299–319, 1990

work page 1990

[28] [28]

A comprehensive study of convergent and commutative replicated data types

Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A comprehensive study of convergent and commutative replicated data types. Technical Report RR-7506, INRIA, 2011

work page 2011

[29] [29]

Conflict-free replicated data types

Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Conflict-free replicated data types. InProceedings of the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), volume 6976 ofLecture Notes in Computer Science, pages 386–400. Springer, 2011

work page 2011

[30] [30]

Animating rotation with quaternion curves

Ken Shoemake. Animating rotation with quaternion curves. InProceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 245–254, 1985

work page 1985

[31] [31]

Eventually consistent.Communica- tions of the ACM, 52(1):40–44, 2009

Werner Vogels. Eventually consistent.Communica- tions of the ACM, 52(1):40–44, 2009

work page 2009

[32] [32]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine- tuned models improves accuracy without increasing inference time. InProceedings of the 39th Inter- national Confer...

work page 2022

[33] [33]

TIES-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. In Advances in Neural Information Processing Systems 36 (NeurIPS), 2023

work page 2023

[34] [34]

Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities.ACM Computing Surveys, 58(8), 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities.ACM Computing Surveys, 58(8), 2026

work page 2026

[35] [35]

Representationsurgeryformulti-taskmodelmerging

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representationsurgeryformulti-taskmodelmerging. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 56332–56356, 2024

work page 2024

[36] [36]

AdaMerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. AdaMerging: Adaptive model merging for multi-task learning. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[37] [37]

nearly associative

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super Mario: Absorbing abilities from homologous models as a free lunch. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. 12 A Controlled Verification Results This appendix presents the full per-strategy results for Tier 1 (controlled4 ×4tensor)...

work page 2024

[38] [38]

When N1 and N2 synchronise (in either order), both compute merge(S′ 1,S′

work page

[39] [39]

= merge(S′ 2,S′ 1)by commutativity [29]

work page

[40] [40]

Both nodes now have identical visible sets:{θ1,θ2}

work page

[41] [41]

For multi-party convergence with k > 2nodes, associativity guarantees that the order of pairwise state merges does not affect the final state [28]

Both nodes call resolve(·,σ,·), sorting by hash, seeding randomness identically, and obtaining the same merged modelθ∗. For multi-party convergence with k > 2nodes, associativity guarantees that the order of pairwise state merges does not affect the final state [28]. Whether node N3 merges first withN1 or N2, the final visible set—and therefore the resolv...

work page

[42] [42]

Gossip time grows quadratically in the number of nodes (reflecting all-pairs state exchange), while per-call merge() cost remains constant in tensor size. As noted in Section 6.5, this prototype gossip protocol is designed for validation purposes; production deployments beyond ∼50nodes would benefit from optimised dissemination protocols. 17 T able 5:Hugg...

work page