pith. sign in

arxiv: 2606.01717 · v1 · pith:F4HCXQAQnew · submitted 2026-06-01 · 💻 cs.LG

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

Pith reviewed 2026-06-28 15:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords instruction tuningdecentralized trainingweight merginggradient conflictPCA splittingmultimodal modelsvariance reductionmodel merging
0
0 comments X

The pith

Splitting instruction tuning data along gradient conflict axes and merging weights once outperforms joint training on multimodal benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that joint training on heterogeneous instruction mixtures creates gradient interference that hurts final performance. It develops a local quadratic theory showing that a single weight merge after independent training reduces variance in a curvature-weighted way. This theory motivates splitting the mixture into partitions aligned with the top PCA directions of dataset-level conflicts, training each partition separately with no communication between them, and reconciling via token-weighted averaging. On a 3B multimodal model across 136 tasks the method raises the average benchmark score from 54.3 to 57.0; the same recipe matches or exceeds joint training on a 7B model with 1.6 million examples from 176 sources. The result implies that decentralized alignment can be both cheaper and more effective than centralized training when conflicts are explicitly managed in parameter space.

Core claim

A local quadratic theory inside a shared flat basin shows that weight merging produces curvature-weighted variance reduction, that PCA-aligned conflict splitting maximizes this gain along high-curvature directions, and that merging further acts as spectral filtering with implicit norm regularization. These properties directly motivate a pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top conflict axes, fine-tunes each partition independently, and merges the resulting models once via token-weighted averaging.

What carries the argument

Local quadratic theory inside a shared flat basin, which predicts curvature-weighted variance reduction from merging and optimal splitting along high-curvature PCA axes of gradient conflicts.

If this is right

  • The 3B model reaches 57.0 average on the eight benchmarks instead of 54.3 from joint training.
  • The recipe scales to a 7B model on a 1.6-million-example, 176-source mixture while matching or exceeding joint-training performance.
  • Training requires no communication between partitions until the final merge step.
  • The method transfers to text-only FLAN mixtures with similar gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conflict-splitting logic could be applied to other multi-objective fine-tuning settings such as preference tuning or multi-task reinforcement learning.
  • If the flat-basin assumption holds more broadly, repeated merge-and-split cycles might allow continual addition of new data sources without full retraining.
  • The spectral-filtering effect of merging suggests that the procedure may implicitly control model capacity in ways that explicit regularization cannot easily replicate.

Load-bearing premise

The loss surfaces of models fine-tuned on different partitions lie in one shared flat basin where the local quadratic approximation correctly captures curvature and conflicts.

What would settle it

Running the full pipeline on the 136-task Vision-FLAN mixture and finding that the 8-benchmark average falls below the 54.3 joint-training baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.01717 by Geewook Kim, Minsik Choi.

Figure 1
Figure 1. Figure 1: Centralized training vs. MERIT. Centralized tun￾ing synchronizes conflicting tasks across a tightly-coupled cluster (top); MERIT partitions the mixture by conflict, fine-tunes each group independently, and merges once into ¯θ (bottom). step is particularly critical for multimodal large language models (MLLMs) (Liu et al., 2024a;b), where much of the practical multimodal behavior is acquired after language … view at source ↗
Figure 2
Figure 2. Figure 2: Local loss surfaces before and after basin preparation (e.g., LLaVA Stage 2). The merge-ready initialization θ (0) resides in a flat, connected region (right), where independently fine-tuned checkpoints remain within the same basin. Our analysis operates within this flat-basin regime, yielding three key implications that directly motivate MERIT’s algorithm design. split to make simple averaging effective u… view at source ↗
Figure 3
Figure 3. Figure 3: Post-hoc merging baselines applied to MERIT’s 7B 2D branches (K=4). Each operator replaces MERIT’s token￾weighted averaging as a drop-in merge step on the same four branches. Bars show the mean over 3 seeds on the 8-benchmark suite; circles mark per-seed scores and error bars denote ±1 std. Token-weighted averaging outperforms all four alternatives; we attribute this to MERIT’s branches being complementary… view at source ↗
Figure 4
Figure 4. Figure 4: Weight perturbation robustness. The merged model (blue) exhibits consistently lower loss sensitivity than the jointly trained model (red) at all perturbation levels and epochs, providing direct evidence of a flatter loss landscape. B.3. Displacement Contraction By convexity of the squared norm, weight averaging contracts the merged model toward the shared initialization: ∥ ¯θw − θ (0)∥ 2 ≤ P i wi∥θi − θ (0… view at source ↗
Figure 5
Figure 5. Figure 5: Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten & Hinton, 2008) of dataset-level gradients at θ (0) (Qwen2.5-VL-3B, 136 Vision-FLAN tasks), overlaid with kernel density estimation (KDE) contours. Each marker is one task. Labeled examples are colored by task type: VQA (yellow), image classification (red), and description/captioning (green). This task-type structure is wha… view at source ↗
Figure 6
Figure 6. Figure 6: PCA-based visualization of dataset groups in gradient space. The split groups extend along distinct, often conflicting directions from the shared initialization, while the merged model lies near the center, reflecting aggregation of complementary updates. The gain Gvar is exact (Theorem F.1); the two remainders are the cubic Taylor residual |∆HO| and the second-order (hence approximate) split-mismatch pena… view at source ↗
read the original abstract

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that a local quadratic theory inside a shared flat basin yields three results (curvature-weighted variance reduction from merging, PCA-aligned splitting to maximize gain along high-curvature directions, and implicit spectral filtering), directly motivating the MERIT pipeline: dataset-level conflict estimation, PCA-based partitioning of the mixture, independent fine-tuning of partitions, and a single token-weighted merge. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks it reports an 8-benchmark average lift from 54.3 (joint training) to 57.0; the recipe is said to scale to a 7B model on a 1.6M-example 176-source mixture while matching or exceeding centralized training at low overhead, and to transfer to text-only FLAN. Code is released.

Significance. If the local quadratic regime and the reported gains hold under scrutiny, the work would offer a practical route to decentralized instruction tuning that reduces communication while improving performance on heterogeneous multimodal mixtures. The public code release is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: the local quadratic theory is asserted to produce variance reduction, PCA maximization, and spectral filtering, yet no derivation, Hessian conditioning argument, or explicit equations are supplied; without these steps it is impossible to verify that the three results follow from the stated assumptions or that they explain the observed 2.7-point gain.
  2. [Abstract] Abstract: the central empirical claim (54.3 o 57.0 on the 8-benchmark average) is presented without error bars, ablation on the number of PCA axes or conflict metric, or controls that isolate the contribution of the quadratic-motivated splitting; this leaves open whether the improvement is independent of the specific hyper-parameters chosen to fit the data.
minor comments (1)
  1. The abstract states that the recipe transfers to text-only FLAN but supplies no quantitative results or experimental details for that transfer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and empirical claims. We clarify that the theoretical derivations appear in the main text and commit to strengthening the presentation of both theory and experiments in revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the local quadratic theory is asserted to produce variance reduction, PCA maximization, and spectral filtering, yet no derivation, Hessian conditioning argument, or explicit equations are supplied; without these steps it is impossible to verify that the three results follow from the stated assumptions or that they explain the observed 2.7-point gain.

    Authors: The local quadratic theory is developed in Section 3, which derives the curvature-weighted variance reduction (Eq. 2), the PCA-aligned splitting that maximizes gain along high-curvature directions (Eq. 4), and the implicit spectral filtering with norm regularization (Eq. 6), supported by Hessian conditioning arguments under the shared flat-basin assumption. The abstract summarizes these results. We will add a parenthetical reference to Section 3 in the abstract to make the connection explicit. revision: yes

  2. Referee: [Abstract] Abstract: the central empirical claim (54.3 o 57.0 on the 8-benchmark average) is presented without error bars, ablation on the number of PCA axes or conflict metric, or controls that isolate the contribution of the quadratic-motivated splitting; this leaves open whether the improvement is independent of the specific hyper-parameters chosen to fit the data.

    Authors: The main results table in Section 4 reports averages over three random seeds (with standard deviations provided in the appendix), and the appendix already contains ablations varying the number of PCA axes and the conflict metric. To directly address the concern, we will move the error bars into the main table, add a new main-text ablation isolating the quadratic-motivated splitting from hyper-parameter choices, and include a control that replaces PCA splitting with random partitioning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theory presented as independent derivation motivating the method

full rationale

The abstract states that a local quadratic theory inside a shared flat basin is developed first and yields three results (curvature-weighted variance reduction, PCA-aligned splitting, spectral filtering) that then motivate MERIT. No equations or self-citations are provided in the given text that reduce the theory or the reported gains back to fitted parameters or prior self-work by construction. The empirical improvements (54.3 to 57.0 on Qwen2.5-VL-3B) are presented as validation of the independently derived pipeline rather than a tautological restatement of inputs. This is the common case of a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of a local quadratic approximation in a shared flat basin and on the assumption that gradient conflicts can be reliably estimated from the data partitions.

free parameters (1)
  • number of PCA conflict axes
    Chosen to capture high-curvature conflict directions; exact count is a modeling choice that affects the split.
axioms (1)
  • domain assumption Local quadratic theory inside a shared flat basin yields curvature-weighted variance reduction and PCA-aligned splitting benefit
    Invoked in the abstract to derive the three theoretical results that motivate MERIT.

pith-pipeline@v0.9.1-grok · 5762 in / 1333 out tokens · 33755 ms · 2026-06-28T15:14:44.413438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Gemma 3 Technical Report

    URL https://openreview.net/forum? id=DgH9YCsqWm. Gargiulo, A. A., Crisostomi, D., Bucarelli, M. S., Scar- dapane, S., Silvestri, F., and Rodola, E. Task singular vectors: Reducing task interference in model merging. In Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18695–18705, 2025. Garipov, T., Izmailov, P., Podoprikhin, D....

  2. [2]

    emnlp-main.944/

    URL https://aclanthology.org/2024. emnlp-main.944/. Laurenc ¸on, H., Tronchon, L., Cord, M., and Sanh, V . What matters when building vision-language models?Ad- vances in Neural Information Processing Systems, 37: 87874–87907, 2024. Lee, S., Kim, G., Kim, J., Lee, H., Chang, H., Park, S. H., and Seo, M. How Does Vision-Language Adaptation Impact the Safet...

  3. [3]

    Li, Q., He, B., and Song, D

    URL https://openreview.net/forum? id=zKv8qULV6n. Li, Q., He, B., and Song, D. Practical One-Shot Feder- ated Learning for Cross-Silo Setting. In Zhou, Z.-H. (ed.),Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 1484–

  4. [4]

    doi: 10.24963/ijcai.2021/

    International Joint Conferences on Artificial Intel- ligence Organization, 8 2021. doi: 10.24963/ijcai.2021/

  5. [5]

    2021/205

    URL https://doi.org/10.24963/ijcai. 2021/205. Main Track. Li, Q., Xie, C., Xu, X., Liu, X., Zhang, C., Li, B., He, B., and Song, D. Effective and efficient federated tree learning on hybrid data. InInternational Conference on Learning Representations, 2024c. URL https:// openreview.net/forum?id=py4ZV2qYQI. Liu, B., Liu, X., Jin, X., Stone, P., and Liu, Q....

  6. [6]

    Marczak, D., Magistri, S., Cygert, S., Twardowski, B., Bag- danov, A

    URL https://openreview.net/forum? id=qMUbhGUFUb. Marczak, D., Magistri, S., Cygert, S., Twardowski, B., Bag- danov, A. D., and van de Weijer, J. No task left be- hind: Isotropic model merging with common and task- specific subspaces. InInternational Conference on Ma- chine Learning. PMLR, 2025. Matena, M. S. and Raffel, C. A. Merging models with fisher- w...

  7. [7]

    Qin, Y ., Qian, C., Yi, J., Chen, W., Lin, Y ., Han, X., Liu, Z., Sun, M., and Zhou, J

    URL https://openreview.net/forum? id=7EhS3YBxjY. Qin, Y ., Qian, C., Yi, J., Chen, W., Lin, Y ., Han, X., Liu, Z., Sun, M., and Zhou, J. Exploring mode connectiv- ity for pre-trained language models. InProceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pp. 6726–6746, Abu Dhabi, United Arab Emirates, 2022. Association...

  8. [8]

    Qwen2.5 Technical Report

    URL https://aclanthology.org/2022. emnlp-main.451/. Qin, Z., Deng, S., Zhao, M., and Yan, X. FedAPEN: Per- sonalized cross-silo federated learning with adaptability to statistical heterogeneity. InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pp. 1954–1964, 2023. Qin, Z., Chen, D., Qian, B., Ding, B., Li, Y ., and D...

  9. [9]

    Xu, M., Cai, D., Wu, Y ., Li, X., and Wang, S.{FwdLLM}: Efficient federated finetuning of large language models with perturbed inferences

    URL https://proceedings.mlr.press/ v162/wortsman22a.html. Xu, M., Cai, D., Wu, Y ., Li, X., and Wang, S.{FwdLLM}: Efficient federated finetuning of large language models with perturbed inferences. In2024 USENIX Annual Tech- nical Conference (USENIX ATC 24), pp. 579–596, 2024a. Xu, Z., Feng, C., Shao, R., Ashby, T., Shen, Y ., Jin, D., Cheng, Y ., Wang, Q....

  10. [10]

    Joints+Contacts

    URL https://aclanthology.org/2024. findings-acl.905/. Yadav, P., Tam, D., Choshen, L., Raffel, C. A., and Bansal, M. TIES-Merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115, 2023. Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y . Language models are super mario: Absorbing abilities from homol- ...

  11. [11]

    For each benchmark, we report the standard metric defined by the dataset

    for general multimodal reasoning and compositional generalization; LLaV A-Wild (Liu et al., 2023) and MMVet (Yu et al., 2024c) for open-ended and fine-grained multimodal understanding; TextVQA (Singh et al., 2019) for text-centric visual understanding; AI2D (Kembhavi et al., 2016) for diagram-based reasoning; MathVista (Lu et al., 2024) for mathe- 15 Dece...

  12. [12]

    Rank-1 conflicts:if M≈µ 1v1v⊤ 1 and v1 is maximally spread (|v1,t|= 1/ √ T ), then GPCA var /E[Grand var ]≥T−1 : PCA captures all conflict structure while random wastes a(1−1/T)fraction

  13. [13]

    1When T is even and v1 has no zero entries, sign(v1) is balanced w.h.p

    Isotropic conflicts:if µ1 =· · ·=µ T−1 = tr(M)/(T−1) , then (10) is tight and no strategy beats random—when every direction matters equally, there is no preferred axis to split along. 1When T is even and v1 has no zero entries, sign(v1) is balanced w.h.p. for generic v1; otherwise we assume balance or reassign at most one dataset, in which case (9) holds ...