arxiv: 2604.12935 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Task Alignment: A simple and effective proxy for model merging in computer vision

Pau de Jorge , C\'esar Roberto de Souza , Bj\"orn Michele , Mert B\"ulent Sar{\i}y{\i}ld{\i}z , Philippe Weinzaepfel , Florent Perronnin , Diane Larlus , Yannis Kalantidis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords model mergingtask alignmentcomputer visionhyperparameter selectionmulti-task learningdecoder trainingvision models

0 comments

The pith

A task alignment proxy enables efficient hyperparameter selection for merging vision models with trainable decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Merging models fine-tuned on different tasks from a shared pretrained base is useful but becomes impractical when each candidate merge requires training a decoder before evaluation. The paper introduces task alignment as a lightweight proxy that measures compatibility between the tasks of the fine-tuned models and correlates with final merged performance. This proxy replaces most of the expensive decoder training and downstream testing during hyperparameter search. The result is that model merging can be applied to a wider set of computer vision problems that use heterogeneous trainable decoders rather than being limited to frozen-head CLIP classification.

Core claim

The central claim is that task alignment scores between pairs of fine-tuned models serve as a reliable and cheap substitute for full downstream evaluation when choosing merge hyperparameters such as coefficients or layer-wise weights. Because decoder training dominates the cost in non-CLIP settings, replacing most evaluations with the proxy reduces search time by orders of magnitude while producing merged models whose accuracy after decoder training stays close to the accuracy obtained by exhaustive search.

What carries the argument

The task alignment proxy, a scalar measure of task compatibility computed directly from the fine-tuned models that predicts which merges will succeed after decoder training.

If this is right

Hyperparameter search for model merging no longer requires training and evaluating a decoder for every candidate set of merge coefficients.
Model merging becomes feasible for multi-task vision pipelines that rely on custom, trainable decoders rather than frozen classification heads.
The performance gap between proxy-guided selection and exhaustive downstream selection remains small enough to be acceptable in practice.
Merging techniques can be applied to broader families of vision models beyond standard CLIP-based image classification setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy idea could be tested on sequential merging of more than two models by ranking candidate merges according to pairwise alignment.
If the correlation holds, the proxy might reduce the barrier to merging models that were fine-tuned with different data augmentations or optimization schedules.
Practitioners could combine the proxy with existing layer-wise or task-vector merging methods to further cut the remaining decoder training cost.

Load-bearing premise

Task alignment scores correlate strongly enough with the final downstream performance of the merged model after decoder training, across the tested vision tasks and merge methods.

What would settle it

A collection of vision tasks and merge methods in which models with the highest task alignment scores produce merged models that, after decoder training, perform materially worse than models selected by the lowest alignment scores.

Figures

Figures reproduced from arXiv: 2604.12935 by Bj\"orn Michele, C\'esar Roberto de Souza, Diane Larlus, Florent Perronnin, Mert B\"ulent Sar{\i}y{\i}ld{\i}z, Pau de Jorge, Philippe Weinzaepfel, Yannis Kalantidis.

**Figure 1.** Figure 1: (Left) Model merging methods produce a large number of candidates which depend on hyperparameters (λ, µ). (Middle) SOTA methods select hyperparameters based on validation performance. This is efficient for frozen decoders like CLIP (standard benchmarks) but becomes infeasible for more complex sets of tasks that require fine-tuning multiple task-specific decoders, for each hyperparameter candidate. (Right)… view at source ↗

**Figure 2.** Figure 2: TAP reduces hyperparameter selection costs by 3 orders of magnitude when merging models for the heterogeneous task setting (see Sec. 4.1). When downstream evaluation is costly due to task-dependent decoder training, hyperparameter search becomes impractical. Our proposed Task Alignment Proxy (TAP) drastically reduces hyperparameter selection costs while maintaining performance. Note original Adamerging… view at source ↗

**Figure 3.** Figure 3: Normalized TAP vs. performance for different values of the merging coefficient λ when using Task Arithmetic (TA) on the heterogeneous task setting, see Sec. 4.1. The selected value is indicated by a dashed line. (TAP). The intuition behind TAP is the following: if the features of the finetuned encoder θt and the merged encoder θmerged are similar for images of task t, then downstream performance should a… view at source ↗

**Figure 4.** Figure 4: Top row: Correlation between Task Alignment Proxy (TAP) and Performance on the validation sets when merging models for CLIP classification with a ViT-L - 20 task setting (left), 3D segmentation with different LiDAR sensors (middle) and heterogeneous vision tasks (right). Bottom row: Performance with TAP-selected hyperparameters vs. optimal hyperparameters (with costly downstream evaluation) for the same … view at source ↗

**Figure 5.** Figure 5: Left: TAP vs. Batch size. When increasing the number of samples per task to compute TAP, the score becomes more stable. However, the selected model remains the same even when computing TAP on very few images. Right: TAP with different distance metrics. TAP is very robust to the distance metric used to compare finetuned and merged model features. In Tab. 3 we apply different model merging methods with TAP-… view at source ↗

**Figure 6.** Figure 6: Top row: Task Vector Norms for the LiDAR benchmark (left) and for the DUNE benchmark when finetuning models from DUNE (middle) or DINOv2 (right) Bottom row: Cosine similarity between task vectors and the average task vector. We observe that for all settings, task vector norms are rather large and unbalanced, strongly biasing the average task vector towards the task with high norm. performance, there seems … view at source ↗

**Figure 7.** Figure 7: Performance vs. task subsets. When task vector norms are small and balanced (left-most), merging methods perform significantly better than when some task vectors are much larger than others. Task subset experiments. To further explore the impact of task vector norms, we conduct merging experiments on task subsets from the DUNE benchmark. We consider the following 4 subsets (some with balanced norms): {A… view at source ↗

**Figure 8.** Figure 8: Performance vs. Hyp. selection cost for LiDAR setting. TAP reduces hyperparameter selection costs by 2 orders of magnitude when merging models for the LiDAR-based segmentation (see Sec. 4.1). Note original Adamerging is not compatible with non-categorical tasks. takes around 4 days to complete. Other tasks take less than 24 hours on a single H100. For the shorter ablation protocol we do as follows: – ADE20… view at source ↗

**Figure 9.** Figure 9: Performance vs. task subsets in the LiDAR setting using nuScenes (NS), Panda64 (PD64) and PandaGT (PDGT). When task vector norms are small and balanced (left-most), merging methods perform significantly better than when some task vectors are much larger than others. merging the Panda64 and PandaGT datasets in [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Left: Correlation between Task Alignment Proxy (TAP) and Performance on the validation sets when merging models the het- erogeneous vision tasks with DINOv2 as the base model. Right: Performance with TAP-selected hyper- parameters vs. optimal hyperparameters (with costly downstream evaluation) for the same setting. In line with [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Frozen decoders ablation. We show the normalized performance when evaluating merged models with frozen decoders vs those same models when decoders are trained. We observe that frozen decoders performance is not a good proxy for trained decoders performance. reported in Tab. 4 of the main paper, where we finetune the TAP-selected hyperparameters with the full protocol described in DUNE [43] and evaluate t… view at source ↗

**Figure 12.** Figure 12: Task vector norms for the 20-task CLIP benchmark. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Cosine similarity between the task vectors of the CLIP merging setting and the average task vector. Given that all task vector norms are of similar magnitude, the average task vector is similarly aligned with all tasks. 1 2 3 4 5 6 7 8 9 10 11 12 0.6 1 10 20 ViT-B Encoder Layer Parameter change (log scale) CLIP-mean/std DUNE-NYUd DUNE-ADE20k DUNE-MapFree DUNE-BEDLAM [PITH_FULL_IMAGE:figures/full_fig_p028… view at source ↗

**Figure 14.** Figure 14: Task vector norms per layer for the CLIP and DUNE benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Left: Loss curves observed while training AdaMerging+TAP. We can observe similar behavior between dense tasks (ADE20k, NYUd), and a different behavior in Bedlam. Right: λ values for different tasks observed while training AdaMerging+TAP. Again, we observe that similar tasks show a similar pattern, while more specific tasks (Bedlam) go a different path [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

read the original abstract

Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a practical proxy to avoid full decoder retraining when tuning model merges for non-CLIP vision tasks, but the whole thing stands or falls on how tightly that proxy tracks real downstream performance.

read the letter

The central point is that task alignment gives a cheap way to rank merge hyperparameters when the merged model still needs a trainable decoder afterward. In standard CLIP merging the decoder is frozen so you can score right away, but here the cost of training heterogeneous heads makes brute-force search expensive. The proxy sidesteps that by scoring alignment before any decoder training happens, and the authors say it cuts the search time by orders of magnitude while keeping final accuracy close to the oracle choice.

Referee Report

2 major / 1 minor

Summary. The paper introduces a task alignment proxy to enable efficient hyperparameter selection during model merging of fine-tuned vision models that share a pretrained backbone but require training of heterogeneous decoders. It claims that this proxy can replace costly full downstream evaluations, speeding up the process by orders of magnitude while preserving merged-model performance, thereby extending model merging beyond CLIP-style classification to broader multi-task vision settings.

Significance. If the proxy's correlation with final performance holds robustly, the work would address a key practical barrier in applying model merging to decoder-heavy tasks such as detection or segmentation, where hyperparameter search is currently prohibitive. This could increase the adoption of merging techniques in realistic computer-vision pipelines.

major comments (2)

[Experiments] The central claim that the task alignment proxy can replace full evaluation while retaining performance rests on an untested correlation between proxy scores and downstream metrics after decoder training. The experiments section should report quantitative rank-preservation statistics (e.g., Spearman ρ or top-k retention rate) across merge methods, tasks, and hyperparameter grids to demonstrate that the proxy identifies the same optimal configurations as the true metric.
[Method] It is unclear whether the task alignment computation remains independent of decoder-specific details when decoders are heterogeneous and trained from scratch. A precise definition (pseudocode or equations) of how alignment is measured without decoder training, together with an ablation on decoder training cost, is needed to confirm the proxy's claimed generality beyond frozen-decoder CLIP classification.

minor comments (1)

[Abstract] The abstract states the speed-up claim without citing the observed factor or the baseline search method; adding a concrete number (e.g., 'from 100 GPU-hours to 2 GPU-hours') would strengthen the practical impact statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript accordingly to strengthen the evidence and clarity.

read point-by-point responses

Referee: [Experiments] The central claim that the task alignment proxy can replace full evaluation while retaining performance rests on an untested correlation between proxy scores and downstream metrics after decoder training. The experiments section should report quantitative rank-preservation statistics (e.g., Spearman ρ or top-k retention rate) across merge methods, tasks, and hyperparameter grids to demonstrate that the proxy identifies the same optimal configurations as the true metric.

Authors: We agree that quantitative rank-preservation statistics would provide stronger support for the central claim. In the revised manuscript we will report Spearman rank correlation coefficients (ρ) and top-k retention rates between proxy scores and true downstream metrics, computed across the merge methods, tasks, and hyperparameter grids already present in our experiments. These additions will directly demonstrate that the proxy recovers the same optimal configurations identified by full evaluation. revision: yes
Referee: [Method] It is unclear whether the task alignment computation remains independent of decoder-specific details when decoders are heterogeneous and trained from scratch. A precise definition (pseudocode or equations) of how alignment is measured without decoder training, together with an ablation on decoder training cost, is needed to confirm the proxy's claimed generality beyond frozen-decoder CLIP classification.

Authors: Task alignment is computed exclusively on backbone features after merging and before any decoder training, rendering it independent of decoder architecture or training details. We will add explicit equations and pseudocode in the revised Methods section to formalize this computation. We will also include an ablation quantifying the wall-clock savings from avoiding decoder training during hyperparameter search, thereby confirming generality to heterogeneous decoders in multi-task vision settings. revision: yes

Circularity Check

0 steps flagged

No circularity: task alignment proxy defined independently of downstream performance

full rationale

The paper introduces the task alignment proxy as a new, independently motivated quantity for ranking merged models without requiring full decoder training and evaluation. No equations, derivations, or self-citations are presented in the abstract or reader's summary that reduce the proxy to a fitted parameter, a self-defined quantity, or a renamed known result. The central claim is an empirical statement that the proxy correlates sufficiently with final performance to serve as a fast surrogate; this correlation is external to the definition of the proxy itself and is not forced by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the proxy itself is introduced without derivation details visible here.

pith-pipeline@v0.9.0 · 5509 in / 975 out tokens · 38071 ms · 2026-05-10T14:51:27.724868+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 5 canonical work pages · 2 internal anchors

[1]

In: Proc

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: Proc. NeurIPS (2022) 1

2022
[2]

In: Proc

Arnold, E., Wynn, J., Vicente, S., Garcia-Hernando, G., Monszpart, Á., Prisacariu, V.A., Turmukhambetov, D., Brachmann, E.: Map-free visual relocalization: Metric pose relative to a single image. In: Proc. ECCV (2022) 8

2022
[3]

In: Proc

Baradel, F., Armando, M., Galaaoui, S., Brégier, R., Weinzaepfel, P., Rogez, G., Lucas, T.: Multi-HMR: Multi-person whole-body human mesh recovery in a single shot. In: Proc. ECCV (2024) 8, 3, 4

2024
[4]

In: Proc

Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: SemanticKITTI: A dataset for semantic scene understanding of lidar sequences. In: Proc. ICCV (2019) 7, 2

2019
[5]

In: Proc

Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proc. CVPR (2023) 8

2023
[6]

In: Proc

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – Mining discriminative com- ponents with random forests. In: Proc. ECCV (2014) 1

2014
[7]

In: Proc

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. In: Proc. CVPR (2020) 7, 2

2020
[8]

Proceedings of the IEEE (2017) 1

Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE (2017) 1

2017
[9]

arXiv preprint arXiv:2412.12153 (2024) 3

Choi,J.,Kim, D.,Lee, C.,Hong, S.:Revisiting weight averagingformodelmerging. arXiv preprint arXiv:2412.12153 (2024) 3

work page arXiv 2024
[10]

In: Proc

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proc. CVPR (2014) 1

2014
[11]

Deep Learning for Classical Japanese Literature

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.: Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718 (2018) 2 16 P. de Jorge et al

work page Pith review arXiv 2018
[12]

In: Proc

Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proc. AISTATS (2011) 1

2011
[13]

In: Proc

Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: Emnist: Extending mnist to handwritten letters. In: Proc. IJCNN (2017) 1

2017
[14]

In: Proc

Davari, M., Belilovsky, E.: Model breadcrumbs: Scaling multi-task model merging with sparse masks. In: Proc. ECCV (2024) 4

2024
[15]

In: Proc

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. ICLR (2021) 2

2021
[16]

In: Proc

Du, Y., Wang, X., Chen, C., Ye, J., Wang, Y., Li, P., Yan, M., Zhang, J., Huang, F., Sui, Z., et al.: Adamms: Model merging for heterogeneous multimodal large lan- guage models with unsupervised coefficient optimization. In: Proc. CVPR (2025) 4, 5

2025
[17]

CVPR (2025) 4

Dziadzio, S., Udandarao, V., Roth, K., Prabhu, A., Akata, Z., Albanie, S., Bethge, M.: How to merge your multimodal models over time? In: Proc. CVPR (2025) 4

2025
[18]

RA-L (2021) 7, 2

Fong, W.K., Mohan, R., Hurtado, J.V., Zhou, L., Caesar, H., Beijbom, O., Valada, A.: Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. RA-L (2021) 7, 2

2021
[19]

In: Proc

Gargiulo, A.A., Crisostomi, D., Bucarelli, M.S., Scardapane, S., Silvestri, F., Rodola, E.: Task singular vectors: Reducing task interference in model merging. In: Proc. CVPR (2025) 4, 7, 1, 5, 9

2025
[20]

In: Proc

Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: Proc. CVPR (2012) 7, 2

2012
[21]

In: Neural Information Processing (2013) 1

Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representa- tion learning: A report on three machine learning contests. In: Neural Information Processing (2013) 1

2013
[22]

JSTAEORS (2019) 1

Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. JSTAEORS (2019) 1

2019
[23]

In: Proc

Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., Farhadi, A.: Editing models with task arithmetic. In: Proc. ICLR (2023) 1, 3, 7

2023
[24]

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP.https://github.com/mlfoundations/open_clip(2021), version 0.1 5

2021
[25]

In: Proc

Jin, X., Ren, X., Preotiuc-Pietro, D., Cheng, P.: Dataless knowledge fusion by merging weights of language models. In: Proc. ICLR (2023) 3

2023
[26]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12

work page internal anchor Pith review Pith/arXiv arXiv 2014
[27]

In: Proc

Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine- grained cars. In: Proc. CVPR-W (2013) 1

2013
[28]

Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009) 1

2009
[29]

Proceedings of the IEEE (1998) 1

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998) 1

1998
[30]

In: Proc

Lee, Y.A., Ko, C.Y., Pedapati, T., Chung, I.H., Yeh, M.Y., Chen, P.Y.: Star: Spectral truncation and rescale for model merging. In: Proc. HLT-NAACL (2025) 1, 4 Task Alignment 17

2025
[31]

In: Proc

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3D with MASt3R. In: Proc. ECCV (2024) 8, 11, 3, 4

2024
[32]

In: Proc

Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Proc. NeurIPS (2022) 3

2022
[33]

In: Proc

Michele, B., Boulch, A., Vu, T.H., Puy, G., Marlet, R., Courty, N.: Train till you drop: Towards stable and robust source-free unsupervised 3d domain adaptation. In: Proc. ECCV (2024) 10, 2

2024
[34]

In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 1

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 1

2011
[35]

Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number ofclasses.In:IndianConferenceonComputerVision,Graphics&ImageProcessing (2008) 1

2008
[36]

TMLR (2024) 1, 8, 2, 3, 4

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Syn- naeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual fe...

2024
[37]

In: Proc

Ortiz-Jimenez, G., Favero, A., Frossard, P.: Task arithmetic in the tangent space: Improved editing of pre-trained models. In: Proc. NeurIPS (2023) 1

2023
[38]

In: Proc

Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proc. CVPR (2012) 1

2012
[39]

In: Proc

Puy, G., Boulch, A., Marlet, R.: Using a waffle iron for automotive point cloud semantic segmentation. In: Proc. ICCV (2023) 7, 2

2023
[40]

In: Proc

Puy, G., Gidaris, S., Boulch, A., Siméoni, O., Sautier, C., Pérez, P., Bursuc, A., Marlet, R.: Three pillars improving vision foundation model distillation for lidar. In: Proc. CVPR (2024) 7, 8, 10, 2, 3

2024
[41]

In: Proc

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proc. ICML (2021) 1, 5, 2

2021
[42]

In: Proc

Ranzinger,M.,Heinrich,G.,Kautz,J.,Molchanov,P.:AM-RADIO:Agglomerative vision foundation model reduce all domains into one. In: Proc. CVPR (2024) 1

2024
[43]

In: Proc

Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., de Jorge, P., Larlus, D., Kalantidis, Y.: Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. In: Proc. CVPR (2025) 1, 3, 7, 8, 10, 11, 4

2025
[44]

In: Proc

Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: Proc. ECCV (2012) 8

2012
[45]

In: Proc

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP (2013) 1

2013
[46]

K., Arnab, A., Iscen, A., Castro, P

Sokar, G., Dziugaite, G.K., Arnab, A., Iscen, A., Castro, P.S., Schmid, C.: Contin- ual learning in vision-language models via aligned model merging. arXiv preprint arXiv:2506.03189 (2025) 4

work page arXiv 2025
[47]

In: Proc

Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The german traffic sign recogni- tion benchmark: A multi-class classification competition. In: Proc. IJCNN (2011) 1

2011
[48]

In: MICCAI

Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equiv- ariant cnns for digital pathology. In: MICCAI. Springer (2018) 1 18 P. de Jorge et al

2018
[49]

In: Proc

Wang, K., Dimitriadis, N., Favero, A., Ortiz-Jimenez, G., Fleuret, F., Frossard, P.: Lines: Post-training layer scaling prevents forgetting and enhances model merging. In: Proc. ICLR (2025) 1, 4, 7

2025
[50]

In: Proc

Wang, K., Dimitriadis, N., Ortiz-Jiménez, G., Fleuret, F., Frossard, P.: Localizing task information for improved model merging and compression. In: Proc. ICML (2024) 4, 7, 2, 3

2024
[51]

In: Proc

Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al.: Model soups: Averaging weights of multiple fine-tuned models improves accuracy without in- creasing inference time. In: Proc. ICML (2022) 3

2022
[52]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 1

work page internal anchor Pith review arXiv 2017
[53]

IJCV (2016) 1

Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: Sun database: Exploring a large collection of scene categories. IJCV (2016) 1

2016
[54]

In: ITSC (2021) 7, 2

Xiao, P., Shao, Z., Hao, S., Zhang, Z., Chai, X., Jiao, J., Li, Z., Wu, J., Sun, K., Jiang, K., et al.: Pandaset: Advanced sensor suite dataset for autonomous driving. In: ITSC (2021) 7, 2

2021
[55]

In: Proc

Yadav, P., Tam, D., Choshen, L., Raffel, C., Bansal, M.: TIES-merging: Resolving interference when merging models. In: Proc. NeurIPS (2023) 3, 7

2023
[56]

Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., Tao, D.: Adamerging: Adaptive model merging for multi-task learning. Proc. ICLR (2024) 4, 5, 7, 12

2024
[57]

In: Proc

Yi,L.,Gong,B.,Funkhouser,T.:Complete&label:Adomainadaptationapproach to semantic segmentation of lidar point clouds. In: Proc. CVPR (2021) 10

2021
[58]

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20k dataset. IJCV (2019) 8 Task Alignment 1 — Supplementary Material — Task Alignment: A simple and effective proxy for model merging in computer vision Table of content A Additional details on the experimental protocol ............

2019
[59]

out- of-the-box

We also tested normalizing the Task Alignment 5 Table 6: Hyperparameter vs TAP selection on CLIP benchmark. We report the test performance of the merged models with exhaustive hyperparameter selection on the validation setvs. TAP selected hyperparameters. For each merging method, best results are in bold. Despite not having access to labels and using only...