Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging
Pith reviewed 2026-06-28 15:14 UTC · model grok-4.3
The pith
Splitting instruction tuning data along gradient conflict axes and merging weights once outperforms joint training on multimodal benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A local quadratic theory inside a shared flat basin shows that weight merging produces curvature-weighted variance reduction, that PCA-aligned conflict splitting maximizes this gain along high-curvature directions, and that merging further acts as spectral filtering with implicit norm regularization. These properties directly motivate a pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top conflict axes, fine-tunes each partition independently, and merges the resulting models once via token-weighted averaging.
What carries the argument
Local quadratic theory inside a shared flat basin, which predicts curvature-weighted variance reduction from merging and optimal splitting along high-curvature PCA axes of gradient conflicts.
If this is right
- The 3B model reaches 57.0 average on the eight benchmarks instead of 54.3 from joint training.
- The recipe scales to a 7B model on a 1.6-million-example, 176-source mixture while matching or exceeding joint-training performance.
- Training requires no communication between partitions until the final merge step.
- The method transfers to text-only FLAN mixtures with similar gains.
Where Pith is reading between the lines
- The same conflict-splitting logic could be applied to other multi-objective fine-tuning settings such as preference tuning or multi-task reinforcement learning.
- If the flat-basin assumption holds more broadly, repeated merge-and-split cycles might allow continual addition of new data sources without full retraining.
- The spectral-filtering effect of merging suggests that the procedure may implicitly control model capacity in ways that explicit regularization cannot easily replicate.
Load-bearing premise
The loss surfaces of models fine-tuned on different partitions lie in one shared flat basin where the local quadratic approximation correctly captures curvature and conflicts.
What would settle it
Running the full pipeline on the 136-task Vision-FLAN mixture and finding that the 8-benchmark average falls below the 54.3 joint-training baseline would falsify the central claim.
Figures
read the original abstract
Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a local quadratic theory inside a shared flat basin yields three results (curvature-weighted variance reduction from merging, PCA-aligned splitting to maximize gain along high-curvature directions, and implicit spectral filtering), directly motivating the MERIT pipeline: dataset-level conflict estimation, PCA-based partitioning of the mixture, independent fine-tuning of partitions, and a single token-weighted merge. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks it reports an 8-benchmark average lift from 54.3 (joint training) to 57.0; the recipe is said to scale to a 7B model on a 1.6M-example 176-source mixture while matching or exceeding centralized training at low overhead, and to transfer to text-only FLAN. Code is released.
Significance. If the local quadratic regime and the reported gains hold under scrutiny, the work would offer a practical route to decentralized instruction tuning that reduces communication while improving performance on heterogeneous multimodal mixtures. The public code release is a clear strength for reproducibility.
major comments (2)
- [Abstract] Abstract: the local quadratic theory is asserted to produce variance reduction, PCA maximization, and spectral filtering, yet no derivation, Hessian conditioning argument, or explicit equations are supplied; without these steps it is impossible to verify that the three results follow from the stated assumptions or that they explain the observed 2.7-point gain.
- [Abstract] Abstract: the central empirical claim (54.3 o 57.0 on the 8-benchmark average) is presented without error bars, ablation on the number of PCA axes or conflict metric, or controls that isolate the contribution of the quadratic-motivated splitting; this leaves open whether the improvement is independent of the specific hyper-parameters chosen to fit the data.
minor comments (1)
- The abstract states that the recipe transfers to text-only FLAN but supplies no quantitative results or experimental details for that transfer.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and empirical claims. We clarify that the theoretical derivations appear in the main text and commit to strengthening the presentation of both theory and experiments in revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the local quadratic theory is asserted to produce variance reduction, PCA maximization, and spectral filtering, yet no derivation, Hessian conditioning argument, or explicit equations are supplied; without these steps it is impossible to verify that the three results follow from the stated assumptions or that they explain the observed 2.7-point gain.
Authors: The local quadratic theory is developed in Section 3, which derives the curvature-weighted variance reduction (Eq. 2), the PCA-aligned splitting that maximizes gain along high-curvature directions (Eq. 4), and the implicit spectral filtering with norm regularization (Eq. 6), supported by Hessian conditioning arguments under the shared flat-basin assumption. The abstract summarizes these results. We will add a parenthetical reference to Section 3 in the abstract to make the connection explicit. revision: yes
-
Referee: [Abstract] Abstract: the central empirical claim (54.3 o 57.0 on the 8-benchmark average) is presented without error bars, ablation on the number of PCA axes or conflict metric, or controls that isolate the contribution of the quadratic-motivated splitting; this leaves open whether the improvement is independent of the specific hyper-parameters chosen to fit the data.
Authors: The main results table in Section 4 reports averages over three random seeds (with standard deviations provided in the appendix), and the appendix already contains ablations varying the number of PCA axes and the conflict metric. To directly address the concern, we will move the error bars into the main table, add a new main-text ablation isolating the quadratic-motivated splitting from hyper-parameter choices, and include a control that replaces PCA splitting with random partitioning. revision: yes
Circularity Check
No significant circularity; theory presented as independent derivation motivating the method
full rationale
The abstract states that a local quadratic theory inside a shared flat basin is developed first and yields three results (curvature-weighted variance reduction, PCA-aligned splitting, spectral filtering) that then motivate MERIT. No equations or self-citations are provided in the given text that reduce the theory or the reported gains back to fitted parameters or prior self-work by construction. The empirical improvements (54.3 to 57.0 on Qwen2.5-VL-3B) are presented as validation of the independently derived pipeline rather than a tautological restatement of inputs. This is the common case of a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of PCA conflict axes
axioms (1)
- domain assumption Local quadratic theory inside a shared flat basin yields curvature-weighted variance reduction and PCA-aligned splitting benefit
Reference graph
Works this paper leans on
-
[1]
URL https://openreview.net/forum? id=DgH9YCsqWm. Gargiulo, A. A., Crisostomi, D., Bucarelli, M. S., Scar- dapane, S., Silvestri, F., and Rodola, E. Task singular vectors: Reducing task interference in model merging. In Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18695–18705, 2025. Garipov, T., Izmailov, P., Podoprikhin, D....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1997.9.1.1 2025
-
[2]
URL https://aclanthology.org/2024. emnlp-main.944/. Laurenc ¸on, H., Tronchon, L., Cord, M., and Sanh, V . What matters when building vision-language models?Ad- vances in Neural Information Processing Systems, 37: 87874–87907, 2024. Lee, S., Kim, G., Kim, J., Lee, H., Chang, H., Park, S. H., and Seo, M. How Does Vision-Language Adaptation Impact the Safet...
arXiv 2024
-
[3]
Li, Q., He, B., and Song, D
URL https://openreview.net/forum? id=zKv8qULV6n. Li, Q., He, B., and Song, D. Practical One-Shot Feder- ated Learning for Cross-Silo Setting. In Zhou, Z.-H. (ed.),Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 1484–
-
[4]
International Joint Conferences on Artificial Intel- ligence Organization, 8 2021. doi: 10.24963/ijcai.2021/
-
[5]
URL https://doi.org/10.24963/ijcai. 2021/205. Main Track. Li, Q., Xie, C., Xu, X., Liu, X., Zhang, C., Li, B., He, B., and Song, D. Effective and efficient federated tree learning on hybrid data. InInternational Conference on Learning Representations, 2024c. URL https:// openreview.net/forum?id=py4ZV2qYQI. Liu, B., Liu, X., Jin, X., Stone, P., and Liu, Q....
-
[6]
Marczak, D., Magistri, S., Cygert, S., Twardowski, B., Bag- danov, A
URL https://openreview.net/forum? id=qMUbhGUFUb. Marczak, D., Magistri, S., Cygert, S., Twardowski, B., Bag- danov, A. D., and van de Weijer, J. No task left be- hind: Isotropic model merging with common and task- specific subspaces. InInternational Conference on Ma- chine Learning. PMLR, 2025. Matena, M. S. and Raffel, C. A. Merging models with fisher- w...
arXiv 2025
-
[7]
Qin, Y ., Qian, C., Yi, J., Chen, W., Lin, Y ., Han, X., Liu, Z., Sun, M., and Zhou, J
URL https://openreview.net/forum? id=7EhS3YBxjY. Qin, Y ., Qian, C., Yi, J., Chen, W., Lin, Y ., Han, X., Liu, Z., Sun, M., and Zhou, J. Exploring mode connectiv- ity for pre-trained language models. InProceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pp. 6726–6746, Abu Dhabi, United Arab Emirates, 2022. Association...
-
[8]
URL https://aclanthology.org/2022. emnlp-main.451/. Qin, Z., Deng, S., Zhao, M., and Yan, X. FedAPEN: Per- sonalized cross-silo federated learning with adaptability to statistical heterogeneity. InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pp. 1954–1964, 2023. Qin, Z., Chen, D., Qian, B., Ding, B., Li, Y ., and D...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i05.6399 2022
-
[9]
URL https://proceedings.mlr.press/ v162/wortsman22a.html. Xu, M., Cai, D., Wu, Y ., Li, X., and Wang, S.{FwdLLM}: Efficient federated finetuning of large language models with perturbed inferences. In2024 USENIX Annual Tech- nical Conference (USENIX ATC 24), pp. 579–596, 2024a. Xu, Z., Feng, C., Shao, R., Ashby, T., Shen, Y ., Jin, D., Cheng, Y ., Wang, Q....
-
[10]
URL https://aclanthology.org/2024. findings-acl.905/. Yadav, P., Tam, D., Choshen, L., Raffel, C. A., and Bansal, M. TIES-Merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023. Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y . Language models are super mario: Absorbing abilities from homol- ...
-
[11]
For each benchmark, we report the standard metric defined by the dataset
for general multimodal reasoning and compositional generalization; LLaV A-Wild (Liu et al., 2023) and MMVet (Yu et al., 2024c) for open-ended and fine-grained multimodal understanding; TextVQA (Singh et al., 2019) for text-centric visual understanding; AI2D (Kembhavi et al., 2016) for diagram-based reasoning; MathVista (Lu et al., 2024) for mathe- 15 Dece...
2023
-
[12]
Rank-1 conflicts:if M≈µ 1v1v⊤ 1 and v1 is maximally spread (|v1,t|= 1/ √ T ), then GPCA var /E[Grand var ]≥T−1 : PCA captures all conflict structure while random wastes a(1−1/T)fraction
-
[13]
1When T is even and v1 has no zero entries, sign(v1) is balanced w.h.p
Isotropic conflicts:if µ1 =· · ·=µ T−1 = tr(M)/(T−1) , then (10) is tight and no strategy beats random—when every direction matters equally, there is no preferred axis to split along. 1When T is even and v1 has no zero entries, sign(v1) is balanced w.h.p. for generic v1; otherwise we assume balance or reassign at most one dataset, in which case (9) holds ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.