pith. sign in

arxiv: 2605.20803 · v1 · pith:KK5M6N4Enew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

Pith reviewed 2026-05-21 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningmodel mergingpreference vectortask performance controlMAGMAXcatastrophic forgettinglarge pre-trained models
0
0 comments X

The pith

Tunable MAGMAX lets a preference vector adjust how much each task influences a merged continual learning model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move beyond average-performance merging in continual learning by giving users direct control over task emphasis in the final model. It does this by adding a preference vector that decides how many elements from each task-specific parameter set enter the merged weights. The vector is built automatically from a small sample of target-environment data plus the original training sets, so no hand-tuning is required. If the approach holds, the same merged model could be redeployed across environments that care about different tasks without retraining or forgetting.

Core claim

Tunable MAGMAX introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing the merged model performance to be adjusted according to deployment needs. A method is also given for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification.

What carries the argument

The preference vector, which specifies how many elements to draw from each task-specific parameter vector when forming the merged model.

If this is right

  • Task-wise performance in the merged model can be shifted toward any chosen subset of tasks to match a target environment.
  • Manual specification of merging weights is replaced by an automatic procedure that uses only a small target sample.
  • The same base merged model can be adapted to multiple deployment settings while remaining competitive with standard baselines on continual learning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference mechanism could be layered on top of other merging algorithms that currently fix a single global weighting.
  • If the small target sample is collected periodically, the method might support online re-tuning as user priorities drift.
  • Preference vectors for different users could be stored and swapped at inference time to personalize a single deployed model.

Load-bearing premise

Small amounts of target environment data combined with training-task datasets are sufficient to automatically construct preference vectors that reliably produce the desired task-performance trade-offs without manual tuning or overfitting to the small target sample.

What would settle it

If automatically constructed preference vectors produce task-performance trade-offs that deviate substantially from manually chosen vectors when both are evaluated on held-out target-environment data, the automatic-construction claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.20803 by Kei Hiroshima, Kento Uchida, Shinichi Shirakawa.

Figure 1
Figure 1. Figure 1: Our method constructs the merged model tailored to the target envi [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Preference vector for CIFAR-100-5 with α = 0.5, 2.0 (and d ≈ 86 × 106 ). 4.2 Evaluation of Tunable MAGMAX with Exemplary Preference Vectors In this section, we evaluate the flexibility of the task-wise performance of the merged model constructed by Tunable MAGMAX. As an exemplary setting of the preference vector, we introduce a coefficient α ≥ 0 and define the t-th element of the preference vector nT as nt… view at source ↗
Figure 3
Figure 3. Figure 3: Top-1 accuracy for the first and last tasks on CIFAR-100-5 (left), -20 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average accuracy with dif￾ferent numbers of tasks in target environments in CIFAR-100-20 The results show that Tunable MAG￾MAX (Label) consistently achieves high per￾formance across varying numbers of tasks in the target environment, especially when the target environment involves a small number of tasks. Tunable MAGMAX (OT) also shows high accuracy when M is small, while its performance deteriorates when … view at source ↗
read the original abstract

Continual learning (CL) aims to train models sequentially on multiple tasks while mitigating catastrophic forgetting of previously learned knowledge. Recent advances in large pre-trained models (LPMs) and model merging techniques, such as MAGMAX, have demonstrated effective CL performance by combining task-specific parameters. However, existing methods primarily focus on average performance across all tasks and do not adequately address how to construct models accommodating different deployment environments or varying user preferences. This paper proposes a model merging framework, termed Tunable MAGMAX, which enables preference-aware control of task-specific performance in CL. Our method introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing us to adjust the merged model performance according to their deployment needs. We further propose a method for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification. The experimental result on CL benchmark tasks demonstrates that Tunable MAGMAX effectively controls task-wise performance and successfully adapts merged models to various target environments. The proposed Tunable MAGMAX achieves superior or comparable performance to baseline methods, making it a practical solution for deploying CL models to various environments where the preferences of each task performance differ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Tunable MAGMAX, an extension of the MAGMAX model merging technique for continual learning. It introduces a tunable preference vector that controls the selection of elements from each task-specific vector during merging, enabling adjustment of task-wise performance to match different deployment environments or user preferences. The method further includes an automatic procedure to construct suitable preference vectors from small amounts of target-environment data together with the original training-task datasets, removing the need for manual specification. Experiments on standard continual learning benchmarks are reported to show effective task-wise control, successful adaptation to varied target environments, and performance that is superior or comparable to existing baselines.

Significance. If the central claims hold after addressing the points below, the work would offer a practical advance for deploying merged continual-learning models in heterogeneous environments where average-performance merging is insufficient. The automatic preference-vector construction from limited target data is a notable strength for usability, provided it proves robust; this directly tackles the gap left by prior merging methods that optimize only for aggregate metrics.

major comments (2)
  1. [§3.3] §3.3 (Preference Vector Construction): The automatic construction of the preference vector from small target-environment samples plus training-task data is presented without explicit regularization, hold-out validation, or sample-size ablation. Because the central claim of reliable tunability and adaptation rests on this step producing generalizable trade-offs at deployment time, the absence of these safeguards leaves open the risk that the reported control is an artifact of fitting to the small target sample.
  2. [Table 4] Table 4 (target-environment rows): Performance deltas for different preference settings are shown without error bars, multiple random seeds, or statistical significance tests. This weakens the claim that Tunable MAGMAX “effectively controls task-wise performance” across environments, as it is impossible to judge whether the observed trade-offs are stable or sensitive to the particular small target samples used.
minor comments (2)
  1. [Abstract] The abstract states “the experimental result” (singular); rephrasing to “experimental results” would improve readability.
  2. [§4] Notation for the preference vector (denoted p or similar) is introduced in §3 but its precise range and normalization are not restated in the experimental section; a brief reminder would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, outlining revisions that will strengthen the manuscript while preserving the core contributions.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Preference Vector Construction): The automatic construction of the preference vector from small target-environment samples plus training-task data is presented without explicit regularization, hold-out validation, or sample-size ablation. Because the central claim of reliable tunability and adaptation rests on this step producing generalizable trade-offs at deployment time, the absence of these safeguards leaves open the risk that the reported control is an artifact of fitting to the small target sample.

    Authors: We agree that the absence of these elements in §3.3 represents a gap that could undermine confidence in generalizability. In the revised manuscript we will add a sample-size ablation (reporting results for 1, 5, 10, and 20 target samples per task) and introduce a hold-out validation split from the target-environment data to select the preference vector. We will also incorporate and discuss L2 regularization on the preference vector during optimization to mitigate overfitting. These changes will be presented in an expanded §3.3 and a new supplementary section. revision: yes

  2. Referee: [Table 4] Table 4 (target-environment rows): Performance deltas for different preference settings are shown without error bars, multiple random seeds, or statistical significance tests. This weakens the claim that Tunable MAGMAX “effectively controls task-wise performance” across environments, as it is impossible to judge whether the observed trade-offs are stable or sensitive to the particular small target samples used.

    Authors: We concur that reporting variability and significance is necessary to support the stability claims. In the revision we will rerun the Table 4 experiments with five independent random seeds for both preference-vector construction and evaluation. Mean and standard-deviation values will be added as error bars, and we will include paired statistical tests (e.g., t-tests) against the baseline rows to quantify significance of the observed task-wise trade-offs. Updated tables and text will appear in the main body and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a preference vector to control task-specific performance during MAGMAX-style merging and describes an automatic construction procedure that uses small target-environment samples plus training-task data. No equations, self-citations, or uniqueness theorems are quoted that reduce the central performance-control claim to a tautological redefinition or to a fitted parameter on the identical evaluation data. The reported results are framed as empirical outcomes on standard CL benchmarks, leaving the derivation self-contained against external validation rather than internally forced by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method rests on the unstated premise that task vectors remain sufficiently orthogonal or additive after selective element selection, and that the small target dataset is representative enough to choose the preference vector without introducing bias.

free parameters (1)
  • preference vector
    Determines how many elements are taken from each task vector; its values are either chosen or automatically derived from limited target data.

pith-pipeline@v0.9.0 · 5753 in / 1109 out tokens · 38847 ms · 2026-05-21T06:06:35.536577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    Akiba, T., Shing, M., Tang, Y., Sun, Q., Ha, D.: Evolutionary Optimiza- tion of Model Merging Recipes (Mar 2024),http://arxiv.org/abs/2403.13187, arXiv:2403.13187 [cs]

  2. [2]

    In: Ad- vances in Neural Information Processing Systems

    Aljundi, R., Belilovsky, E., Tuytelaars, T., Charlin, L., Caccia, M., Lin, M., Page- Caccia, L.: Online Continual Learning with Maximal Interfered Retrieval. In: Ad- vances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019)

  3. [3]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Bandarkar, L., Muller, B., Yuvraj, P., Hou, R., Singhal, N., Lv, H., Liu, B.: Layer swapping for zero-shot cross-lingual transfer in large language models. In: The Thirteenth International Conference on Learning Representations (2025)

  4. [4]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (October 2021)

  5. [5]

    In:AdvancesinNeuralInformationProcessingSystems.vol.26.CurranAssociates, Inc

    Cuturi, M.: Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In:AdvancesinNeuralInformationProcessingSystems.vol.26.CurranAssociates, Inc. (2013)

  6. [6]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009) Preference-Aware Model Merging for Continual Learning 15

  7. [7]

    In: International Conference on Learning Representations (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  8. [8]

    ICCV (2021)

    Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., Gilmer, J.: The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV (2021)

  9. [9]

    In: The Eleventh International Conference on Learning Representations (2023)

    Ilharco, G., Ribeiro, M.T., Wortsman, M., Schmidt, L., Hajishirzi, H., Farhadi, A.: Editing models with task arithmetic. In: The Eleventh International Conference on Learning Representations (2023)

  10. [10]

    Journal of Mathematical Sciences 133, 1381–1382 (2006)

    Kantorovich, L.: On the translocation of masses. Journal of Mathematical Sciences 133, 1381–1382 (2006)

  11. [11]

    Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., Hadsell, R.: Overcoming catastrophic forgetting in neu- ral networks. Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

  12. [12]

    Master’s thesis, University of Tront (2009)

    Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s thesis, University of Tront (2009)

  13. [13]

    IEEE Transactions on Pattern Analysis and Machine Intelligence40(12), 2935–2947 (2018)

    Li, Z., Hoiem, D.: Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence40(12), 2935–2947 (2018)

  14. [14]

    In: Computer Vision – ECCV

    Marczak, D., Twardowski, B., Trzciński, T., Cygert, S.: MAGMAX: Leveraging Model Merging for Seamless Continual Learning. In: Computer Vision – ECCV

  15. [15]

    pp. 379–395. Springer Nature Switzerland, Cham (2025)

  16. [16]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. CoRRabs/2103.00020(2021)

  17. [17]

    In: Proceedings of the 38th International Conference on Machine Learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transfer- able visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–87...

  18. [18]

    Progressive Neural Networks

    Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive neural networks. arXiv:1606.04671 (2022)

  19. [19]

    Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: LAION-400M: Open dataset of CLIP- filtered 400 million image-text pairs (2021)

  20. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5362–5383 (2024)

    Wang, L., Zhang, X., Su, H., Zhu, J.: A comprehensive survey of continual learn- ing: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5362–5383 (2024)

  21. [21]

    In: Proceedings of the 39th International Conference on Machine Learning

    Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., Schmidt, L.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: Proceedings of the 39th International Conference on Machine Learning. Proceedings of M...

  22. [22]

    In: Advances in Neural Information Processing Systems

    Yadav, P., Tam, D., Choshen, L., Raffel, C.A., Bansal, M.: TIES-Merging: Re- solving Interference When Merging Models. In: Advances in Neural Information Processing Systems. vol. 36, pp. 7093–7115. Curran Associates, Inc. (2023) 16 K. Hiroshima et al. A Results in Other Dataset Settings We report our main results with the experimental settings in Section ...