pith. machine review for the scientific record. sign in

arxiv: 2604.12935 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Task Alignment: A simple and effective proxy for model merging in computer vision

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords model mergingtask alignmentcomputer visionhyperparameter selectionmulti-task learningdecoder trainingvision models
0
0 comments X

The pith

A task alignment proxy enables efficient hyperparameter selection for merging vision models with trainable decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Merging models fine-tuned on different tasks from a shared pretrained base is useful but becomes impractical when each candidate merge requires training a decoder before evaluation. The paper introduces task alignment as a lightweight proxy that measures compatibility between the tasks of the fine-tuned models and correlates with final merged performance. This proxy replaces most of the expensive decoder training and downstream testing during hyperparameter search. The result is that model merging can be applied to a wider set of computer vision problems that use heterogeneous trainable decoders rather than being limited to frozen-head CLIP classification.

Core claim

The central claim is that task alignment scores between pairs of fine-tuned models serve as a reliable and cheap substitute for full downstream evaluation when choosing merge hyperparameters such as coefficients or layer-wise weights. Because decoder training dominates the cost in non-CLIP settings, replacing most evaluations with the proxy reduces search time by orders of magnitude while producing merged models whose accuracy after decoder training stays close to the accuracy obtained by exhaustive search.

What carries the argument

The task alignment proxy, a scalar measure of task compatibility computed directly from the fine-tuned models that predicts which merges will succeed after decoder training.

If this is right

  • Hyperparameter search for model merging no longer requires training and evaluating a decoder for every candidate set of merge coefficients.
  • Model merging becomes feasible for multi-task vision pipelines that rely on custom, trainable decoders rather than frozen classification heads.
  • The performance gap between proxy-guided selection and exhaustive downstream selection remains small enough to be acceptable in practice.
  • Merging techniques can be applied to broader families of vision models beyond standard CLIP-based image classification setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy idea could be tested on sequential merging of more than two models by ranking candidate merges according to pairwise alignment.
  • If the correlation holds, the proxy might reduce the barrier to merging models that were fine-tuned with different data augmentations or optimization schedules.
  • Practitioners could combine the proxy with existing layer-wise or task-vector merging methods to further cut the remaining decoder training cost.

Load-bearing premise

Task alignment scores correlate strongly enough with the final downstream performance of the merged model after decoder training, across the tested vision tasks and merge methods.

What would settle it

A collection of vision tasks and merge methods in which models with the highest task alignment scores produce merged models that, after decoder training, perform materially worse than models selected by the lowest alignment scores.

Figures

Figures reproduced from arXiv: 2604.12935 by Bj\"orn Michele, C\'esar Roberto de Souza, Diane Larlus, Florent Perronnin, Mert B\"ulent Sar{\i}y{\i}ld{\i}z, Pau de Jorge, Philippe Weinzaepfel, Yannis Kalantidis.

Figure 1
Figure 1. Figure 1: (Left) Model merging methods produce a large number of candidates which depend on hyperparameters (λ, µ). (Middle) SOTA methods select hyperparameters based on validation performance. This is efficient for frozen decoders like CLIP (stan￾dard benchmarks) but becomes infeasible for more complex sets of tasks that require fine-tuning multiple task-specific decoders, for each hyperparameter candidate. (Right)… view at source ↗
Figure 2
Figure 2. Figure 2: TAP reduces hyperparameter selection costs by 3 orders of magni￾tude when merging models for the hetero￾geneous task setting (see Sec. 4.1). When downstream evaluation is costly due to task-dependent decoder training, hyperpa￾rameter search becomes impractical. Our proposed Task Alignment Proxy (TAP) drastically reduces hyperparameter selec￾tion costs while maintaining performance. Note original Adamerging… view at source ↗
Figure 3
Figure 3. Figure 3: Normalized TAP vs. performance for different values of the merging co￾efficient λ when using Task Arithmetic (TA) on the heterogeneous task setting, see Sec. 4.1. The selected value is indicated by a dashed line. (TAP). The intuition behind TAP is the following: if the features of the fine￾tuned encoder θt and the merged encoder θmerged are similar for images of task t, then downstream performance should a… view at source ↗
Figure 4
Figure 4. Figure 4: Top row: Correlation between Task Alignment Proxy (TAP) and Performance on the validation sets when merging models for CLIP classification with a ViT-L - 20 task setting (left), 3D segmentation with different LiDAR sensors (middle) and het￾erogeneous vision tasks (right). Bottom row: Performance with TAP-selected hyper￾parameters vs. optimal hyperparameters (with costly downstream evaluation) for the same … view at source ↗
Figure 5
Figure 5. Figure 5: Left: TAP vs. Batch size. When increasing the number of samples per task to compute TAP, the score becomes more stable. However, the selected model remains the same even when computing TAP on very few images. Right: TAP with different distance metrics. TAP is very robust to the distance metric used to compare fine￾tuned and merged model features. In Tab. 3 we apply different model merging methods with TAP-… view at source ↗
Figure 6
Figure 6. Figure 6: Top row: Task Vector Norms for the LiDAR benchmark (left) and for the DUNE benchmark when finetuning models from DUNE (middle) or DINOv2 (right) Bottom row: Cosine similarity between task vectors and the average task vector. We observe that for all settings, task vector norms are rather large and unbalanced, strongly biasing the average task vector towards the task with high norm. performance, there seems … view at source ↗
Figure 7
Figure 7. Figure 7: Performance vs. task sub￾sets. When task vector norms are small and balanced (left-most), merg￾ing methods perform significantly bet￾ter than when some task vectors are much larger than others. Task subset experiments. To further explore the impact of task vector norms, we conduct merging experiments on task subsets from the DUNE benchmark. We consider the following 4 subsets (some with balanced norms): {A… view at source ↗
Figure 8
Figure 8. Figure 8: Performance vs. Hyp. selection cost for LiDAR setting. TAP reduces hyperparameter selection costs by 2 orders of magnitude when merging models for the LiDAR-based segmentation (see Sec. 4.1). Note original Adamerging is not compatible with non-categorical tasks. takes around 4 days to complete. Other tasks take less than 24 hours on a single H100. For the shorter ablation protocol we do as follows: – ADE20… view at source ↗
Figure 9
Figure 9. Figure 9: Performance vs. task subsets in the LiDAR setting using nuScenes (NS), Panda64 (PD64) and PandaGT (PDGT). When task vector norms are small and bal￾anced (left-most), merging methods perform significantly better than when some task vectors are much larger than others. merging the Panda64 and PandaGT datasets in [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Left: Correlation between Task Alignment Proxy (TAP) and Performance on the validation sets when merging models the het- erogeneous vision tasks with DINOv2 as the base model. Right: Performance with TAP-selected hyper- parameters vs. optimal hyperparameters (with costly downstream evaluation) for the same setting. In line with [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Frozen decoders ablation. We show the normalized performance when evaluating merged models with frozen decoders vs those same models when decoders are trained. We observe that frozen decoders performance is not a good proxy for trained decoders performance. reported in Tab. 4 of the main paper, where we finetune the TAP-selected hy￾perparameters with the full protocol described in DUNE [43] and evaluate t… view at source ↗
Figure 12
Figure 12. Figure 12: Task vector norms for the 20-task CLIP benchmark. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cosine similarity between the task vectors of the CLIP merging setting and the average task vector. Given that all task vector norms are of similar magnitude, the average task vector is similarly aligned with all tasks. 1 2 3 4 5 6 7 8 9 10 11 12 0.6 1 10 20 ViT-B Encoder Layer Parameter change (log scale) CLIP-mean/std DUNE-NYUd DUNE-ADE20k DUNE-MapFree DUNE-BEDLAM [PITH_FULL_IMAGE:figures/full_fig_p028… view at source ↗
Figure 14
Figure 14. Figure 14: Task vector norms per layer for the CLIP and DUNE benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Left: Loss curves observed while training AdaMerging+TAP. We can observe similar behavior between dense tasks (ADE20k, NYUd), and a different behavior in Bedlam. Right: λ values for different tasks observed while training AdaMerging+TAP. Again, we observe that similar tasks show a similar pattern, while more specific tasks (Bedlam) go a different path [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
read the original abstract

Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a task alignment proxy to enable efficient hyperparameter selection during model merging of fine-tuned vision models that share a pretrained backbone but require training of heterogeneous decoders. It claims that this proxy can replace costly full downstream evaluations, speeding up the process by orders of magnitude while preserving merged-model performance, thereby extending model merging beyond CLIP-style classification to broader multi-task vision settings.

Significance. If the proxy's correlation with final performance holds robustly, the work would address a key practical barrier in applying model merging to decoder-heavy tasks such as detection or segmentation, where hyperparameter search is currently prohibitive. This could increase the adoption of merging techniques in realistic computer-vision pipelines.

major comments (2)
  1. [Experiments] The central claim that the task alignment proxy can replace full evaluation while retaining performance rests on an untested correlation between proxy scores and downstream metrics after decoder training. The experiments section should report quantitative rank-preservation statistics (e.g., Spearman ρ or top-k retention rate) across merge methods, tasks, and hyperparameter grids to demonstrate that the proxy identifies the same optimal configurations as the true metric.
  2. [Method] It is unclear whether the task alignment computation remains independent of decoder-specific details when decoders are heterogeneous and trained from scratch. A precise definition (pseudocode or equations) of how alignment is measured without decoder training, together with an ablation on decoder training cost, is needed to confirm the proxy's claimed generality beyond frozen-decoder CLIP classification.
minor comments (1)
  1. [Abstract] The abstract states the speed-up claim without citing the observed factor or the baseline search method; adding a concrete number (e.g., 'from 100 GPU-hours to 2 GPU-hours') would strengthen the practical impact statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript accordingly to strengthen the evidence and clarity.

read point-by-point responses
  1. Referee: [Experiments] The central claim that the task alignment proxy can replace full evaluation while retaining performance rests on an untested correlation between proxy scores and downstream metrics after decoder training. The experiments section should report quantitative rank-preservation statistics (e.g., Spearman ρ or top-k retention rate) across merge methods, tasks, and hyperparameter grids to demonstrate that the proxy identifies the same optimal configurations as the true metric.

    Authors: We agree that quantitative rank-preservation statistics would provide stronger support for the central claim. In the revised manuscript we will report Spearman rank correlation coefficients (ρ) and top-k retention rates between proxy scores and true downstream metrics, computed across the merge methods, tasks, and hyperparameter grids already present in our experiments. These additions will directly demonstrate that the proxy recovers the same optimal configurations identified by full evaluation. revision: yes

  2. Referee: [Method] It is unclear whether the task alignment computation remains independent of decoder-specific details when decoders are heterogeneous and trained from scratch. A precise definition (pseudocode or equations) of how alignment is measured without decoder training, together with an ablation on decoder training cost, is needed to confirm the proxy's claimed generality beyond frozen-decoder CLIP classification.

    Authors: Task alignment is computed exclusively on backbone features after merging and before any decoder training, rendering it independent of decoder architecture or training details. We will add explicit equations and pseudocode in the revised Methods section to formalize this computation. We will also include an ablation quantifying the wall-clock savings from avoiding decoder training during hyperparameter search, thereby confirming generality to heterogeneous decoders in multi-task vision settings. revision: yes

Circularity Check

0 steps flagged

No circularity: task alignment proxy defined independently of downstream performance

full rationale

The paper introduces the task alignment proxy as a new, independently motivated quantity for ranking merged models without requiring full decoder training and evaluation. No equations, derivations, or self-citations are presented in the abstract or reader's summary that reduce the proxy to a fitted parameter, a self-defined quantity, or a renamed known result. The central claim is an empirical statement that the proxy correlates sufficiently with final performance to serve as a fast surrogate; this correlation is external to the definition of the proxy itself and is not forced by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the proxy itself is introduced without derivation details visible here.

pith-pipeline@v0.9.0 · 5509 in / 975 out tokens · 38071 ms · 2026-05-10T14:51:27.724868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    In: Proc

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: Proc. NeurIPS (2022) 1

  2. [2]

    In: Proc

    Arnold, E., Wynn, J., Vicente, S., Garcia-Hernando, G., Monszpart, Á., Prisacariu, V.A., Turmukhambetov, D., Brachmann, E.: Map-free visual relocalization: Metric pose relative to a single image. In: Proc. ECCV (2022) 8

  3. [3]

    In: Proc

    Baradel, F., Armando, M., Galaaoui, S., Brégier, R., Weinzaepfel, P., Rogez, G., Lucas, T.: Multi-HMR: Multi-person whole-body human mesh recovery in a single shot. In: Proc. ECCV (2024) 8, 3, 4

  4. [4]

    In: Proc

    Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: SemanticKITTI: A dataset for semantic scene understanding of lidar sequences. In: Proc. ICCV (2019) 7, 2

  5. [5]

    In: Proc

    Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proc. CVPR (2023) 8

  6. [6]

    In: Proc

    Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – Mining discriminative com- ponents with random forests. In: Proc. ECCV (2014) 1

  7. [7]

    In: Proc

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. In: Proc. CVPR (2020) 7, 2

  8. [8]

    Proceedings of the IEEE (2017) 1

    Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE (2017) 1

  9. [9]

    arXiv preprint arXiv:2412.12153 (2024) 3

    Choi,J.,Kim, D.,Lee, C.,Hong, S.:Revisiting weight averagingformodelmerging. arXiv preprint arXiv:2412.12153 (2024) 3

  10. [10]

    In: Proc

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proc. CVPR (2014) 1

  11. [11]

    Deep Learning for Classical Japanese Literature

    Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.: Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718 (2018) 2 16 P. de Jorge et al

  12. [12]

    In: Proc

    Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proc. AISTATS (2011) 1

  13. [13]

    In: Proc

    Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: Emnist: Extending mnist to handwritten letters. In: Proc. IJCNN (2017) 1

  14. [14]

    In: Proc

    Davari, M., Belilovsky, E.: Model breadcrumbs: Scaling multi-task model merging with sparse masks. In: Proc. ECCV (2024) 4

  15. [15]

    In: Proc

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. ICLR (2021) 2

  16. [16]

    In: Proc

    Du, Y., Wang, X., Chen, C., Ye, J., Wang, Y., Li, P., Yan, M., Zhang, J., Huang, F., Sui, Z., et al.: Adamms: Model merging for heterogeneous multimodal large lan- guage models with unsupervised coefficient optimization. In: Proc. CVPR (2025) 4, 5

  17. [17]

    CVPR (2025) 4

    Dziadzio, S., Udandarao, V., Roth, K., Prabhu, A., Akata, Z., Albanie, S., Bethge, M.: How to merge your multimodal models over time? In: Proc. CVPR (2025) 4

  18. [18]

    RA-L (2021) 7, 2

    Fong, W.K., Mohan, R., Hurtado, J.V., Zhou, L., Caesar, H., Beijbom, O., Valada, A.: Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. RA-L (2021) 7, 2

  19. [19]

    In: Proc

    Gargiulo, A.A., Crisostomi, D., Bucarelli, M.S., Scardapane, S., Silvestri, F., Rodola, E.: Task singular vectors: Reducing task interference in model merging. In: Proc. CVPR (2025) 4, 7, 1, 5, 9

  20. [20]

    In: Proc

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: Proc. CVPR (2012) 7, 2

  21. [21]

    In: Neural Information Processing (2013) 1

    Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representa- tion learning: A report on three machine learning contests. In: Neural Information Processing (2013) 1

  22. [22]

    JSTAEORS (2019) 1

    Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. JSTAEORS (2019) 1

  23. [23]

    In: Proc

    Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., Farhadi, A.: Editing models with task arithmetic. In: Proc. ICLR (2023) 1, 3, 7

  24. [24]

    Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP.https://github.com/mlfoundations/open_clip(2021), version 0.1 5

  25. [25]

    In: Proc

    Jin, X., Ren, X., Preotiuc-Pietro, D., Cheng, P.: Dataless knowledge fusion by merging weights of language models. In: Proc. ICLR (2023) 3

  26. [26]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12

  27. [27]

    In: Proc

    Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine- grained cars. In: Proc. CVPR-W (2013) 1

  28. [28]

    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009) 1

  29. [29]

    Proceedings of the IEEE (1998) 1

    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998) 1

  30. [30]

    In: Proc

    Lee, Y.A., Ko, C.Y., Pedapati, T., Chung, I.H., Yeh, M.Y., Chen, P.Y.: Star: Spectral truncation and rescale for model merging. In: Proc. HLT-NAACL (2025) 1, 4 Task Alignment 17

  31. [31]

    In: Proc

    Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3D with MASt3R. In: Proc. ECCV (2024) 8, 11, 3, 4

  32. [32]

    In: Proc

    Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Proc. NeurIPS (2022) 3

  33. [33]

    In: Proc

    Michele, B., Boulch, A., Vu, T.H., Puy, G., Marlet, R., Courty, N.: Train till you drop: Towards stable and robust source-free unsupervised 3d domain adaptation. In: Proc. ECCV (2024) 10, 2

  34. [34]

    In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 1

    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 1

  35. [35]

    Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number ofclasses.In:IndianConferenceonComputerVision,Graphics&ImageProcessing (2008) 1

  36. [36]

    TMLR (2024) 1, 8, 2, 3, 4

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Syn- naeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual fe...

  37. [37]

    In: Proc

    Ortiz-Jimenez, G., Favero, A., Frossard, P.: Task arithmetic in the tangent space: Improved editing of pre-trained models. In: Proc. NeurIPS (2023) 1

  38. [38]

    In: Proc

    Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proc. CVPR (2012) 1

  39. [39]

    In: Proc

    Puy, G., Boulch, A., Marlet, R.: Using a waffle iron for automotive point cloud semantic segmentation. In: Proc. ICCV (2023) 7, 2

  40. [40]

    In: Proc

    Puy, G., Gidaris, S., Boulch, A., Siméoni, O., Sautier, C., Pérez, P., Bursuc, A., Marlet, R.: Three pillars improving vision foundation model distillation for lidar. In: Proc. CVPR (2024) 7, 8, 10, 2, 3

  41. [41]

    In: Proc

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proc. ICML (2021) 1, 5, 2

  42. [42]

    In: Proc

    Ranzinger,M.,Heinrich,G.,Kautz,J.,Molchanov,P.:AM-RADIO:Agglomerative vision foundation model reduce all domains into one. In: Proc. CVPR (2024) 1

  43. [43]

    In: Proc

    Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., de Jorge, P., Larlus, D., Kalantidis, Y.: Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. In: Proc. CVPR (2025) 1, 3, 7, 8, 10, 11, 4

  44. [44]

    In: Proc

    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: Proc. ECCV (2012) 8

  45. [45]

    In: Proc

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP (2013) 1

  46. [46]

    K., Arnab, A., Iscen, A., Castro, P

    Sokar, G., Dziugaite, G.K., Arnab, A., Iscen, A., Castro, P.S., Schmid, C.: Contin- ual learning in vision-language models via aligned model merging. arXiv preprint arXiv:2506.03189 (2025) 4

  47. [47]

    In: Proc

    Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The german traffic sign recogni- tion benchmark: A multi-class classification competition. In: Proc. IJCNN (2011) 1

  48. [48]

    In: MICCAI

    Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equiv- ariant cnns for digital pathology. In: MICCAI. Springer (2018) 1 18 P. de Jorge et al

  49. [49]

    In: Proc

    Wang, K., Dimitriadis, N., Favero, A., Ortiz-Jimenez, G., Fleuret, F., Frossard, P.: Lines: Post-training layer scaling prevents forgetting and enhances model merging. In: Proc. ICLR (2025) 1, 4, 7

  50. [50]

    In: Proc

    Wang, K., Dimitriadis, N., Ortiz-Jiménez, G., Fleuret, F., Frossard, P.: Localizing task information for improved model merging and compression. In: Proc. ICML (2024) 4, 7, 2, 3

  51. [51]

    In: Proc

    Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al.: Model soups: Averaging weights of multiple fine-tuned models improves accuracy without in- creasing inference time. In: Proc. ICML (2022) 3

  52. [52]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 1

  53. [53]

    IJCV (2016) 1

    Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: Sun database: Exploring a large collection of scene categories. IJCV (2016) 1

  54. [54]

    In: ITSC (2021) 7, 2

    Xiao, P., Shao, Z., Hao, S., Zhang, Z., Chai, X., Jiao, J., Li, Z., Wu, J., Sun, K., Jiang, K., et al.: Pandaset: Advanced sensor suite dataset for autonomous driving. In: ITSC (2021) 7, 2

  55. [55]

    In: Proc

    Yadav, P., Tam, D., Choshen, L., Raffel, C., Bansal, M.: TIES-merging: Resolving interference when merging models. In: Proc. NeurIPS (2023) 3, 7

  56. [56]

    Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., Tao, D.: Adamerging: Adaptive model merging for multi-task learning. Proc. ICLR (2024) 4, 5, 7, 12

  57. [57]

    In: Proc

    Yi,L.,Gong,B.,Funkhouser,T.:Complete&label:Adomainadaptationapproach to semantic segmentation of lidar point clouds. In: Proc. CVPR (2021) 10

  58. [58]

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20k dataset. IJCV (2019) 8 Task Alignment 1 — Supplementary Material — Task Alignment: A simple and effective proxy for model merging in computer vision Table of content A Additional details on the experimental protocol ............

  59. [59]

    out- of-the-box

    We also tested normalizing the Task Alignment 5 Table 6: Hyperparameter vs TAP selection on CLIP benchmark. We report the test performance of the merged models with exhaustive hyperparameter selection on the validation setvs. TAP selected hyperparameters. For each merging method, best results are in bold. Despite not having access to labels and using only...