Recognition: unknown
Task Alignment: A simple and effective proxy for model merging in computer vision
Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3
The pith
A task alignment proxy enables efficient hyperparameter selection for merging vision models with trainable decoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that task alignment scores between pairs of fine-tuned models serve as a reliable and cheap substitute for full downstream evaluation when choosing merge hyperparameters such as coefficients or layer-wise weights. Because decoder training dominates the cost in non-CLIP settings, replacing most evaluations with the proxy reduces search time by orders of magnitude while producing merged models whose accuracy after decoder training stays close to the accuracy obtained by exhaustive search.
What carries the argument
The task alignment proxy, a scalar measure of task compatibility computed directly from the fine-tuned models that predicts which merges will succeed after decoder training.
If this is right
- Hyperparameter search for model merging no longer requires training and evaluating a decoder for every candidate set of merge coefficients.
- Model merging becomes feasible for multi-task vision pipelines that rely on custom, trainable decoders rather than frozen classification heads.
- The performance gap between proxy-guided selection and exhaustive downstream selection remains small enough to be acceptable in practice.
- Merging techniques can be applied to broader families of vision models beyond standard CLIP-based image classification setups.
Where Pith is reading between the lines
- The same proxy idea could be tested on sequential merging of more than two models by ranking candidate merges according to pairwise alignment.
- If the correlation holds, the proxy might reduce the barrier to merging models that were fine-tuned with different data augmentations or optimization schedules.
- Practitioners could combine the proxy with existing layer-wise or task-vector merging methods to further cut the remaining decoder training cost.
Load-bearing premise
Task alignment scores correlate strongly enough with the final downstream performance of the merged model after decoder training, across the tested vision tasks and merge methods.
What would settle it
A collection of vision tasks and merge methods in which models with the highest task alignment scores produce merged models that, after decoder training, perform materially worse than models selected by the lowest alignment scores.
Figures
read the original abstract
Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a task alignment proxy to enable efficient hyperparameter selection during model merging of fine-tuned vision models that share a pretrained backbone but require training of heterogeneous decoders. It claims that this proxy can replace costly full downstream evaluations, speeding up the process by orders of magnitude while preserving merged-model performance, thereby extending model merging beyond CLIP-style classification to broader multi-task vision settings.
Significance. If the proxy's correlation with final performance holds robustly, the work would address a key practical barrier in applying model merging to decoder-heavy tasks such as detection or segmentation, where hyperparameter search is currently prohibitive. This could increase the adoption of merging techniques in realistic computer-vision pipelines.
major comments (2)
- [Experiments] The central claim that the task alignment proxy can replace full evaluation while retaining performance rests on an untested correlation between proxy scores and downstream metrics after decoder training. The experiments section should report quantitative rank-preservation statistics (e.g., Spearman ρ or top-k retention rate) across merge methods, tasks, and hyperparameter grids to demonstrate that the proxy identifies the same optimal configurations as the true metric.
- [Method] It is unclear whether the task alignment computation remains independent of decoder-specific details when decoders are heterogeneous and trained from scratch. A precise definition (pseudocode or equations) of how alignment is measured without decoder training, together with an ablation on decoder training cost, is needed to confirm the proxy's claimed generality beyond frozen-decoder CLIP classification.
minor comments (1)
- [Abstract] The abstract states the speed-up claim without citing the observed factor or the baseline search method; adding a concrete number (e.g., 'from 100 GPU-hours to 2 GPU-hours') would strengthen the practical impact statement.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript accordingly to strengthen the evidence and clarity.
read point-by-point responses
-
Referee: [Experiments] The central claim that the task alignment proxy can replace full evaluation while retaining performance rests on an untested correlation between proxy scores and downstream metrics after decoder training. The experiments section should report quantitative rank-preservation statistics (e.g., Spearman ρ or top-k retention rate) across merge methods, tasks, and hyperparameter grids to demonstrate that the proxy identifies the same optimal configurations as the true metric.
Authors: We agree that quantitative rank-preservation statistics would provide stronger support for the central claim. In the revised manuscript we will report Spearman rank correlation coefficients (ρ) and top-k retention rates between proxy scores and true downstream metrics, computed across the merge methods, tasks, and hyperparameter grids already present in our experiments. These additions will directly demonstrate that the proxy recovers the same optimal configurations identified by full evaluation. revision: yes
-
Referee: [Method] It is unclear whether the task alignment computation remains independent of decoder-specific details when decoders are heterogeneous and trained from scratch. A precise definition (pseudocode or equations) of how alignment is measured without decoder training, together with an ablation on decoder training cost, is needed to confirm the proxy's claimed generality beyond frozen-decoder CLIP classification.
Authors: Task alignment is computed exclusively on backbone features after merging and before any decoder training, rendering it independent of decoder architecture or training details. We will add explicit equations and pseudocode in the revised Methods section to formalize this computation. We will also include an ablation quantifying the wall-clock savings from avoiding decoder training during hyperparameter search, thereby confirming generality to heterogeneous decoders in multi-task vision settings. revision: yes
Circularity Check
No circularity: task alignment proxy defined independently of downstream performance
full rationale
The paper introduces the task alignment proxy as a new, independently motivated quantity for ranking merged models without requiring full decoder training and evaluation. No equations, derivations, or self-citations are presented in the abstract or reader's summary that reduce the proxy to a fitted parameter, a self-defined quantity, or a renamed known result. The central claim is an empirical statement that the proxy correlates sufficiently with final performance to serve as a fast surrogate; this correlation is external to the definition of the proxy itself and is not forced by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proc
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: Proc. NeurIPS (2022) 1
2022
-
[2]
In: Proc
Arnold, E., Wynn, J., Vicente, S., Garcia-Hernando, G., Monszpart, Á., Prisacariu, V.A., Turmukhambetov, D., Brachmann, E.: Map-free visual relocalization: Metric pose relative to a single image. In: Proc. ECCV (2022) 8
2022
-
[3]
In: Proc
Baradel, F., Armando, M., Galaaoui, S., Brégier, R., Weinzaepfel, P., Rogez, G., Lucas, T.: Multi-HMR: Multi-person whole-body human mesh recovery in a single shot. In: Proc. ECCV (2024) 8, 3, 4
2024
-
[4]
In: Proc
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: SemanticKITTI: A dataset for semantic scene understanding of lidar sequences. In: Proc. ICCV (2019) 7, 2
2019
-
[5]
In: Proc
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proc. CVPR (2023) 8
2023
-
[6]
In: Proc
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – Mining discriminative com- ponents with random forests. In: Proc. ECCV (2014) 1
2014
-
[7]
In: Proc
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. In: Proc. CVPR (2020) 7, 2
2020
-
[8]
Proceedings of the IEEE (2017) 1
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE (2017) 1
2017
-
[9]
arXiv preprint arXiv:2412.12153 (2024) 3
Choi,J.,Kim, D.,Lee, C.,Hong, S.:Revisiting weight averagingformodelmerging. arXiv preprint arXiv:2412.12153 (2024) 3
-
[10]
In: Proc
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proc. CVPR (2014) 1
2014
-
[11]
Deep Learning for Classical Japanese Literature
Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.: Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718 (2018) 2 16 P. de Jorge et al
work page Pith review arXiv 2018
-
[12]
In: Proc
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proc. AISTATS (2011) 1
2011
-
[13]
In: Proc
Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: Emnist: Extending mnist to handwritten letters. In: Proc. IJCNN (2017) 1
2017
-
[14]
In: Proc
Davari, M., Belilovsky, E.: Model breadcrumbs: Scaling multi-task model merging with sparse masks. In: Proc. ECCV (2024) 4
2024
-
[15]
In: Proc
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. ICLR (2021) 2
2021
-
[16]
In: Proc
Du, Y., Wang, X., Chen, C., Ye, J., Wang, Y., Li, P., Yan, M., Zhang, J., Huang, F., Sui, Z., et al.: Adamms: Model merging for heterogeneous multimodal large lan- guage models with unsupervised coefficient optimization. In: Proc. CVPR (2025) 4, 5
2025
-
[17]
CVPR (2025) 4
Dziadzio, S., Udandarao, V., Roth, K., Prabhu, A., Akata, Z., Albanie, S., Bethge, M.: How to merge your multimodal models over time? In: Proc. CVPR (2025) 4
2025
-
[18]
RA-L (2021) 7, 2
Fong, W.K., Mohan, R., Hurtado, J.V., Zhou, L., Caesar, H., Beijbom, O., Valada, A.: Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. RA-L (2021) 7, 2
2021
-
[19]
In: Proc
Gargiulo, A.A., Crisostomi, D., Bucarelli, M.S., Scardapane, S., Silvestri, F., Rodola, E.: Task singular vectors: Reducing task interference in model merging. In: Proc. CVPR (2025) 4, 7, 1, 5, 9
2025
-
[20]
In: Proc
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: Proc. CVPR (2012) 7, 2
2012
-
[21]
In: Neural Information Processing (2013) 1
Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representa- tion learning: A report on three machine learning contests. In: Neural Information Processing (2013) 1
2013
-
[22]
JSTAEORS (2019) 1
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. JSTAEORS (2019) 1
2019
-
[23]
In: Proc
Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., Farhadi, A.: Editing models with task arithmetic. In: Proc. ICLR (2023) 1, 3, 7
2023
-
[24]
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP.https://github.com/mlfoundations/open_clip(2021), version 0.1 5
2021
-
[25]
In: Proc
Jin, X., Ren, X., Preotiuc-Pietro, D., Cheng, P.: Dataless knowledge fusion by merging weights of language models. In: Proc. ICLR (2023) 3
2023
-
[26]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[27]
In: Proc
Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine- grained cars. In: Proc. CVPR-W (2013) 1
2013
-
[28]
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009) 1
2009
-
[29]
Proceedings of the IEEE (1998) 1
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998) 1
1998
-
[30]
In: Proc
Lee, Y.A., Ko, C.Y., Pedapati, T., Chung, I.H., Yeh, M.Y., Chen, P.Y.: Star: Spectral truncation and rescale for model merging. In: Proc. HLT-NAACL (2025) 1, 4 Task Alignment 17
2025
-
[31]
In: Proc
Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3D with MASt3R. In: Proc. ECCV (2024) 8, 11, 3, 4
2024
-
[32]
In: Proc
Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Proc. NeurIPS (2022) 3
2022
-
[33]
In: Proc
Michele, B., Boulch, A., Vu, T.H., Puy, G., Marlet, R., Courty, N.: Train till you drop: Towards stable and robust source-free unsupervised 3d domain adaptation. In: Proc. ECCV (2024) 10, 2
2024
-
[34]
In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 1
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 1
2011
-
[35]
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number ofclasses.In:IndianConferenceonComputerVision,Graphics&ImageProcessing (2008) 1
2008
-
[36]
TMLR (2024) 1, 8, 2, 3, 4
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Syn- naeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual fe...
2024
-
[37]
In: Proc
Ortiz-Jimenez, G., Favero, A., Frossard, P.: Task arithmetic in the tangent space: Improved editing of pre-trained models. In: Proc. NeurIPS (2023) 1
2023
-
[38]
In: Proc
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proc. CVPR (2012) 1
2012
-
[39]
In: Proc
Puy, G., Boulch, A., Marlet, R.: Using a waffle iron for automotive point cloud semantic segmentation. In: Proc. ICCV (2023) 7, 2
2023
-
[40]
In: Proc
Puy, G., Gidaris, S., Boulch, A., Siméoni, O., Sautier, C., Pérez, P., Bursuc, A., Marlet, R.: Three pillars improving vision foundation model distillation for lidar. In: Proc. CVPR (2024) 7, 8, 10, 2, 3
2024
-
[41]
In: Proc
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proc. ICML (2021) 1, 5, 2
2021
-
[42]
In: Proc
Ranzinger,M.,Heinrich,G.,Kautz,J.,Molchanov,P.:AM-RADIO:Agglomerative vision foundation model reduce all domains into one. In: Proc. CVPR (2024) 1
2024
-
[43]
In: Proc
Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., de Jorge, P., Larlus, D., Kalantidis, Y.: Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. In: Proc. CVPR (2025) 1, 3, 7, 8, 10, 11, 4
2025
-
[44]
In: Proc
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: Proc. ECCV (2012) 8
2012
-
[45]
In: Proc
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP (2013) 1
2013
-
[46]
K., Arnab, A., Iscen, A., Castro, P
Sokar, G., Dziugaite, G.K., Arnab, A., Iscen, A., Castro, P.S., Schmid, C.: Contin- ual learning in vision-language models via aligned model merging. arXiv preprint arXiv:2506.03189 (2025) 4
-
[47]
In: Proc
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The german traffic sign recogni- tion benchmark: A multi-class classification competition. In: Proc. IJCNN (2011) 1
2011
-
[48]
In: MICCAI
Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equiv- ariant cnns for digital pathology. In: MICCAI. Springer (2018) 1 18 P. de Jorge et al
2018
-
[49]
In: Proc
Wang, K., Dimitriadis, N., Favero, A., Ortiz-Jimenez, G., Fleuret, F., Frossard, P.: Lines: Post-training layer scaling prevents forgetting and enhances model merging. In: Proc. ICLR (2025) 1, 4, 7
2025
-
[50]
In: Proc
Wang, K., Dimitriadis, N., Ortiz-Jiménez, G., Fleuret, F., Frossard, P.: Localizing task information for improved model merging and compression. In: Proc. ICML (2024) 4, 7, 2, 3
2024
-
[51]
In: Proc
Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al.: Model soups: Averaging weights of multiple fine-tuned models improves accuracy without in- creasing inference time. In: Proc. ICML (2022) 3
2022
-
[52]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 1
work page internal anchor Pith review arXiv 2017
-
[53]
IJCV (2016) 1
Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: Sun database: Exploring a large collection of scene categories. IJCV (2016) 1
2016
-
[54]
In: ITSC (2021) 7, 2
Xiao, P., Shao, Z., Hao, S., Zhang, Z., Chai, X., Jiao, J., Li, Z., Wu, J., Sun, K., Jiang, K., et al.: Pandaset: Advanced sensor suite dataset for autonomous driving. In: ITSC (2021) 7, 2
2021
-
[55]
In: Proc
Yadav, P., Tam, D., Choshen, L., Raffel, C., Bansal, M.: TIES-merging: Resolving interference when merging models. In: Proc. NeurIPS (2023) 3, 7
2023
-
[56]
Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., Tao, D.: Adamerging: Adaptive model merging for multi-task learning. Proc. ICLR (2024) 4, 5, 7, 12
2024
-
[57]
In: Proc
Yi,L.,Gong,B.,Funkhouser,T.:Complete&label:Adomainadaptationapproach to semantic segmentation of lidar point clouds. In: Proc. CVPR (2021) 10
2021
-
[58]
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20k dataset. IJCV (2019) 8 Task Alignment 1 — Supplementary Material — Task Alignment: A simple and effective proxy for model merging in computer vision Table of content A Additional details on the experimental protocol ............
2019
-
[59]
out- of-the-box
We also tested normalizing the Task Alignment 5 Table 6: Hyperparameter vs TAP selection on CLIP benchmark. We report the test performance of the merged models with exhaustive hyperparameter selection on the validation setvs. TAP selected hyperparameters. For each merging method, best results are in bold. Despite not having access to labels and using only...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.