Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies
Pith reviewed 2026-05-20 15:38 UTC · model grok-4.3
The pith
A two-layer CRDT wrapper enables any neural network merge strategy to achieve strong eventual consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by using a two-layer architecture called CRDTMergeState, with the first layer handling contributions through OR-Set CRDT semantics based on set union and the second layer executing merge strategies as deterministic pure functions over a canonically ordered contribution set with randomness seeded from the Merkle root, strong eventual consistency is guaranteed for model merging across replicas.
What carries the argument
CRDTMergeState, a two-layer wrapper that uses OR-Set for collecting contributions via set union in the first layer and applies merge strategies deterministically in the second layer.
If this is right
- Replicas converge to identical merged models given the same contributions, independent of order.
- The wrapper is transparent, so the merged model's performance matches the original strategy by construction.
- Tests confirm the properties hold for models up to 7 billion parameters and under network partitions.
- Any of the 26 strategies can be used without modification to their core logic.
Where Pith is reading between the lines
- This could support decentralized model merging in collaborative AI projects without central servers.
- Similar two-layer designs might help other non-commutative operations in machine learning become order-independent.
- Examining the effect of Merkle seeding on strategies with internal randomness could be a next step.
Load-bearing premise
That any merge strategy can be wrapped as a deterministic pure function over a canonically ordered contribution set with randomness seeded from the Merkle root without altering its intended behavior or introducing new inconsistencies.
What would settle it
If replicas receiving identical contributions but in different orders produce merged models with differing parameters, the consistency proof would be invalidated.
read the original abstract
All 26 neural network merge strategies we tested including weight averaging, SLERP, TIES, DARE, Fisher merging, and evolutionary approaches -- fail the algebraic properties (commutativity, associativity, idempotency) required for conflict-free distributed operation. We prove that this failure is structural: normalisation-based merges cannot simultaneously satisfy all three properties. To resolve this, we present a two-layer architecture -- CRDTMergeState -- that wraps any merge strategy in a CRDT-compliant (Conflict-Free Replicated Data Type) layer. Layer 1 manages contributions via OR-Set CRDT semantics, where the merge operation is set union -- trivially commutative, associative, and idempotent. Layer 2 applies merge strategies as deterministic pure functions over a canonically-ordered contribution set, with randomness seeded from the Merkle root. We prove that this separation guarantees Strong Eventual Consistency: all replicas receiving the same contributions compute identical merged models, regardless of message ordering. Empirical validation spans three tiers: controlled 4x4 tensors (104/104 tests pass), production-scale models up to 7.24B parameters (208 strategy-level tests, 43,368 layer-level property checks at capped tensor resolution), and multi-node convergence under gossip and partition healing (100 nodes, 20 orderings), with CRDT overhead below 0.5 ms. Because the wrapper is transparent, downstream performance is identical by construction, confirmed via byte-identical output verification. The reference implementation is available as crdt-merge v0.9.4.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that all 26 tested neural network merge strategies (weight averaging, SLERP, TIES, DARE, Fisher, evolutionary) fail the algebraic properties of commutativity, associativity, and idempotency required for CRDT operation. It proves this failure is structural for normalisation-based merges. To address it, the authors introduce a two-layer CRDTMergeState architecture: Layer 1 uses standard OR-Set CRDT semantics for contribution management (set union), while Layer 2 applies any merge strategy as a deterministic pure function over a canonically ordered contribution set with randomness seeded from the Merkle root. They prove this separation yields Strong Eventual Consistency (identical merged models for identical contribution sets regardless of order). Empirical results include 104/104 controlled tensor tests, 208 strategy-level and 43,368 layer-level checks on models up to 7.24B parameters, and 100-node multi-ordering convergence tests, with <0.5 ms overhead and byte-identical outputs; a reference implementation (crdt-merge v0.9.4) is provided.
Significance. If the central construction holds, the work enables arbitrary neural-network merge strategies to be used safely inside replicated distributed systems while inheriting CRDT consistency guarantees. This is a meaningful bridge between model-merging literature and distributed-systems primitives. Credit is due for the explicit structural impossibility argument, the separation that re-uses standard OR-Set properties, the scale of the empirical validation (including production-scale models and partition-healing scenarios), and the release of reproducible code that permits byte-for-byte verification.
major comments (1)
- [§3.2] §3.2 (Determinism construction): The claim that Merkle-root seeding plus canonical ordering renders any of the 26 strategies a pure deterministic function without altering intended behaviour is load-bearing for the SEC proof. The manuscript should supply a short argument or counter-example showing that this transformation preserves the semantic intent of inherently stochastic strategies (e.g., certain evolutionary or DARE variants) rather than merely producing byte-identical outputs on the tested seeds.
minor comments (4)
- [Abstract] Abstract, line 3: the phrase 'normalisation-based merges' is used before it is defined; a parenthetical gloss or forward reference to §4.1 would improve readability.
- [Table 2] Table 2 (layer-level property checks): the caption states '43,368 checks' but the column sums appear to total 43,200; a brief reconciliation note or corrected count would eliminate the discrepancy.
- [§6.3] §6.3 (multi-node experiments): the gossip and partition-healing scenarios are described at a high level; adding the exact message-delivery schedule or pseudocode for the 20 orderings would aid reproducibility.
- [References] Reference list: the CRDT foundational citations (Shapiro et al., 2011; Preguiça et al.) are present, but recent surveys on model merging (e.g., in federated or decentralised learning) are absent; adding two or three would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the positive summary, the recognition of the work's significance as a bridge between model merging and CRDTs, and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Determinism construction): The claim that Merkle-root seeding plus canonical ordering renders any of the 26 strategies a pure deterministic function without altering intended behaviour is load-bearing for the SEC proof. The manuscript should supply a short argument or counter-example showing that this transformation preserves the semantic intent of inherently stochastic strategies (e.g., certain evolutionary or DARE variants) rather than merely producing byte-identical outputs on the tested seeds.
Authors: We agree that an explicit clarification strengthens the load-bearing claim in §3.2. In the revised manuscript we will insert a concise paragraph arguing that the Merkle-root seeding plus canonical ordering produces a deterministic pure function while preserving semantic intent for stochastic strategies. The argument is as follows: stochastic elements in strategies such as evolutionary merging or DARE variants (e.g., random perturbations, dropout masks, or tie-breaking) are intended to generate a specific merge outcome from a given input set rather than to produce non-reproducible results across replicas. Deriving the seed from the Merkle root of the canonically ordered contribution set fixes the random choices to a value that is a deterministic function of the input set itself. Consequently, every replica that receives the identical contribution set executes the identical sequence of stochastic operations and obtains the identical merged model, satisfying SEC. This does not alter the strategy's intended behaviour for that set; it merely makes the behaviour reproducible, which is a prerequisite for any CRDT-compliant wrapper. As a counter-example, consider a DARE variant that applies random weight dropout: the Merkle-derived seed yields the same dropout mask for any replica holding the same ordered set, producing the same output model that the original stochastic procedure would have produced under that fixed seed. Our existing empirical results (byte-identical outputs across 20 orderings on 100-node tests and 43,368 layer-level checks) already confirm that the transformation yields the expected merge for each contribution set. We will add this short argument and counter-example to §3.2. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation separates contribution management (Layer 1, using standard OR-Set CRDT union which is independently known to be commutative, associative, and idempotent) from strategy application (Layer 2, as a deterministic pure function on a canonically ordered set with Merkle-root seeding). The Strong Eventual Consistency guarantee follows directly from these external algebraic properties plus the added determinism, without any reduction of the result to a fitted parameter, self-definition, or self-citation chain. The structural failure of the 26 raw strategies is shown separately via algebraic counterexamples, and downstream equivalence is confirmed by byte-identical verification, rendering the argument self-contained against standard CRDT benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Merge strategies can be executed as deterministic pure functions once contributions are canonically ordered and randomness is seeded from the Merkle root
invented entities (1)
-
CRDTMergeState
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Layer 1 manages contributions via OR-Set CRDT semantics, where the merge operation is set union—trivially commutative, associative, and idempotent. Layer 2 applies merge strategies as deterministic pure functions over a canonically-ordered contribution set, with randomness seeded from the Merkle root.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that this separation guarantees Strong Eventual Consistency: all replicas receiving the same contributions compute identical merged models, regardless of message ordering.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025
Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025
work page 2025
-
[2]
Delta state replicated data types.Journal of Parallel and Distributed Computing, 111:162–173, 2018
Paulo Sérgio Almeida, Ali Shoker, and Carlos Baquero. Delta state replicated data types.Journal of Parallel and Distributed Computing, 111:162–173, 2018
work page 2018
-
[3]
Making operation-based CRDTs operation- based
Carlos Baquero, Paulo Sérgio Almeida, and Ali Shoker. Making operation-based CRDTs operation- based. InDistributed Applications and Interoperable Systems – 14th IFIP WG 6.1 International Con- ference (DAIS), volume 8460 ofLecture Notes in Computer Science, pages 126–140. Springer, 2014
work page 2014
-
[4]
Machine learning with adversaries: Byzantine tolerant gradient descent
Peva Blanchard, El Mahdi El Mhamdi, Rachid Guer- raoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30 (NeurIPS), pages 119–129, 2017
work page 2017
-
[5]
Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. InProceedings of Machine Learning and Systems (MLSys), 2019
work page 2019
-
[6]
Model breadcrumbs: Scaling multi-task model merg- ing with sparse masks
MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merg- ing with sparse masks. InComputer Vision – ECCV 2024, volume 15133 ofLecture Notes in Computer Science, pages 270–287. Springer, 2024
work page 2024
-
[7]
Dynamo: Amazon’s highly available key-value store
Giuseppe DeCandia, Deniz Hastorun, Madan Jam- pani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. InProceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), pages 205–220, 2007
work page 2007
-
[8]
Della-merging: Reducing interference in model merging through magnitude-based sampling
Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. DELLA-merging: Reducing interference in model merging through magnitude-based sampling. arXiv preprint arXiv:2406.11617, 2024
-
[9]
Ryan Gillespie. Method and system for conflict- free merging of neural network model parameters using convergent replicated data types. UK Patent Application No. GB2607132.4, filed 30 March 2026
work page 2026
-
[10]
Arcee’s MergeKit: A toolkit for merging large lan- guage models
Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large lan- guage models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing: Industry Track (EMNLP Industry Track), 2024
work page 2024
-
[11]
EMR-merging: Tuning-free high-performance model merging
Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. EMR-merging: Tuning-free high-performance model merging. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024
work page 2024
-
[12]
Editing models with task arithmetic
GabrielIlharco, MarcoTulioRibeiro, MitchellWorts- man, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[13]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Dataless knowledge fusion by merging weights of language models
Xisen Jin, Xiang Ren, Daniel Preoţiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Representa- tions (ICLR), 2023. 10
work page 2023
-
[15]
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Gra- ham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, C...
work page 2021
-
[16]
Git- theta: A git extension for collaborative development of machine learning models
Nikhil Kandpal, Brian Lester, Mohammed Muqeeth, Anisha Mascarenhas, Monty Evans, Vishal Baskaran, Tenghao Huang, Haokun Liu, and Colin Raffel. Git- theta: A git extension for collaborative development of machine learning models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023
work page 2023
- [17]
-
[18]
Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019
work page 2019
-
[19]
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.Communications of the ACM, 21(7):558–565, 1978
work page 1978
-
[20]
Federated optimization in heterogeneous networks
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems (MLSys), 2020
work page 2020
-
[21]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gra- dient descent. InAdvances in Neural Information Processing Systems 30 (NeurIPS), 2017
work page 2017
-
[22]
Michael S. Matena and Colin Raffel. Merging models with Fisher-weighted averaging. InAdvances in Neu- ral Information Processing Systems 35 (NeurIPS), 2022
work page 2022
-
[23]
Brendan McMahan, Eider Moore, Daniel Ra- mage, Seth Hampson, and Blaise Agüera y Arcas
H. Brendan McMahan, Eider Moore, Daniel Ra- mage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017
work page 2017
-
[24]
Conflict-free replicated data types (CRDTs)
Nuno Preguiça, Carlos Baquero, and Marc Shapiro. Conflict-free replicated data types (CRDTs). In Encyclopedia of Big Data Technologies. Springer, 2018
work page 2018
-
[25]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019
work page 2019
-
[26]
Merkle-CRDTs: Merkle-DAGs meet CRDTs.arXiv preprint arXiv:2004.00107, 2020
Hector Sanjuan, Samuli Poyhtari, Pedro Teixeira, and Ioannis Psaras. Merkle-CRDTs: Merkle-DAGs meet CRDTs.arXiv preprint arXiv:2004.00107, 2020
- [27]
-
[28]
A comprehensive study of convergent and commutative replicated data types
Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A comprehensive study of convergent and commutative replicated data types. Technical Report RR-7506, INRIA, 2011
work page 2011
-
[29]
Conflict-free replicated data types
Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Conflict-free replicated data types. InProceedings of the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), volume 6976 ofLecture Notes in Computer Science, pages 386–400. Springer, 2011
work page 2011
-
[30]
Animating rotation with quaternion curves
Ken Shoemake. Animating rotation with quaternion curves. InProceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 245–254, 1985
work page 1985
-
[31]
Eventually consistent.Communica- tions of the ACM, 52(1):40–44, 2009
Werner Vogels. Eventually consistent.Communica- tions of the ACM, 52(1):40–44, 2009
work page 2009
-
[32]
Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine- tuned models improves accuracy without increasing inference time. InProceedings of the 39th Inter- national Confer...
work page 2022
-
[33]
TIES-merging: Resolving interference when merging models
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. In Advances in Neural Information Processing Systems 36 (NeurIPS), 2023
work page 2023
-
[34]
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities.ACM Computing Surveys, 58(8), 2026
work page 2026
-
[35]
Representationsurgeryformulti-taskmodelmerging
Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representationsurgeryformulti-taskmodelmerging. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 56332–56356, 2024
work page 2024
-
[36]
AdaMerging: Adaptive model merging for multi-task learning
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. AdaMerging: Adaptive model merging for multi-task learning. InThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[37]
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super Mario: Absorbing abilities from homologous models as a free lunch. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. 12 A Controlled Verification Results This appendix presents the full per-strategy results for Tier 1 (controlled4 ×4tensor)...
work page 2024
-
[38]
When N1 and N2 synchronise (in either order), both compute merge(S′ 1,S′
-
[39]
= merge(S′ 2,S′ 1)by commutativity [29]
-
[40]
Both nodes now have identical visible sets:{θ1,θ2}
-
[41]
Both nodes call resolve(·,σ,·), sorting by hash, seeding randomness identically, and obtaining the same merged modelθ∗. For multi-party convergence with k > 2nodes, associativity guarantees that the order of pairwise state merges does not affect the final state [28]. Whether node N3 merges first withN1 or N2, the final visible set—and therefore the resolv...
-
[42]
Gossip time grows quadratically in the number of nodes (reflecting all-pairs state exchange), while per-call merge() cost remains constant in tensor size. As noted in Section 6.5, this prototype gossip protocol is designed for validation purposes; production deployments beyond ∼50nodes would benefit from optimised dissemination protocols. 17 T able 5:Hugg...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.