Bilinear Coordinate Alignment for Training-Free Task-Vector Transfer

Jinwook Jung; Jungyong Son; Minhee Park; Sungyong Baik

arxiv: 2605.28444 · v1 · pith:HJ6VGYMBnew · submitted 2026-05-27 · 💻 cs.LG

Bilinear Coordinate Alignment for Training-Free Task-Vector Transfer

Jungyong Son , Jinwook Jung , Minhee Park , Sungyong Baik This is my paper

Pith reviewed 2026-06-29 14:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords task vectorsmodel transferbilinear interactionsProcrustes alignmenttraining-free adaptationactivation gradient correspondencecomputer visionnatural language processing

0 comments

The pith

Task vectors form from bilinear interactions between activations and gradients and transfer across models by aligning those spaces without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that task vectors are not simple parameter offsets but the result of accumulated bilinear products between input activations and output gradients. This view turns the transfer problem into one of aligning two separate coordinate systems, one for activations and one for gradients. BiCo solves the alignment by solving two orthogonal Procrustes problems on a small calibration set in a single forward-backward pass. If the approach works, expertise acquired on one model can move to a new model of different width, depth, or pre-training history at negligible cost. The method is shown to close much of the gap to direct fine-tuning on vision and language benchmarks.

Core claim

Task vectors can be derived as accumulated bilinear interactions between input-side activations and output-side gradients. Formulating task-vector transfer as a dual-space alignment problem, BiCo estimates orthogonal Procrustes mappings in both the activation space and the gradient space from a single forward-backward pass on a small calibration set, without any parameter update, and thereby transfers task expertise across models that differ in width, depth, and pre-training configuration.

What carries the argument

Bilinear Coordinate alignment (BiCo), which computes separate orthogonal Procrustes mappings for activation and gradient spaces to align task vectors derived from bilinear interactions.

If this is right

Task expertise moves between models of different widths and depths without retraining.
Transfer succeeds on both computer-vision and natural-language benchmarks.
Only a small calibration set and one forward-backward pass are needed.
Performance improves over methods that align only activations or only gradients.
No gradient steps or parameter changes occur during the transfer step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-alignment step could be inserted into pipelines that already move models between different tokenizers or embedding dimensions.
If the bilinear view holds, analogous alignments might reduce the cost of merging multiple task vectors into one model.
The calibration-set size needed for stable Procrustes estimates could be tested as a function of model scale to find practical limits.
The method might extend directly to continual-learning settings where a base model is periodically replaced.

Load-bearing premise

The bilinear interactions captured by one forward-backward pass on a small calibration set supply enough correspondence to move task expertise between models that differ in width, depth, and pre-training history.

What would settle it

Applying the two estimated Procrustes mappings to a held-out model pair differing in depth and width and finding that accuracy stays at the level of prior activation-only or gradient-only baselines would falsify the claim that dual bilinear alignment is required and sufficient.

Figures

Figures reproduced from arXiv: 2605.28444 by Jinwook Jung, Jungyong Son, Minhee Park, Sungyong Baik.

**Figure 1.** Figure 1: Representational similarity before and after input-side alignment. We compare CKA and cosine similarity between pre-trained activations of same-width ViT-B/16 models with different pre-training distributions (A: Datacomp-XL → B: LAION-2B). (Left) before alignment. (Right) after Procrustes alignment. Results are averaged across 8 vision tasks and sub-layers within each residual block, using N = 100 calibrat… view at source ↗

**Figure 2.** Figure 2: Layer-wise output similarity to the fine-tuned target. For ViT-B/16 (A: Datacomp-XL → B: LAION-2B), we compare layer-wise output activations of each strategy with those of θB,ft. BiCo is closest, especially in deeper blocks. Effect of output-side alignment. We analyze whether each transfer strategy induces layerwise behavior similar to that of the oracle target model θB,ft, obtained by directly fine-tunin… view at source ↗

read the original abstract

Fine-tuning large-scale pre-trained models is a recent prevalent paradigm for adapting general representations to specialized tasks. However, when a new version of a pre-trained model becomes available, expertise acquired through fine-tuning cannot be directly reused because it is tied to the parameterization of the original model, requiring another costly fine-tuning. To address this inefficiency, recent work uses task vectors, defined as the parameter difference between a fine-tuned model and its base model, to transfer expertise across models. While existing methods bridge disparate models by matching activations or gradients, a significant performance gap remains relative to direct fine-tuning, suggesting that these partial correspondences are insufficient. In this work, instead of viewing a task vector merely as a parameter offset, we revisit the formation of task vectors and show that they can be derived as accumulated bilinear interactions between input-side activations and output-side gradients. Motivated by this observation, we formulate task-vector transfer as a dual-space alignment problem and propose BiCo, a training-free framework for transferring task vectors through Bilinear Coordinate alignment. BiCo estimates orthogonal Procrustes mappings in both spaces using a single forward-backward pass on a small calibration set, without any parameter update. Across extensive computer vision and natural language processing benchmarks, BiCo consistently outperforms existing transfer methods across models that differ in width, depth, and pre-training configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bilinear task vector framing and dual alignment is a fresh angle on transfer, but the single-pass calibration assumption needs close checking for models with big differences.

read the letter

The main takeaway is that this paper proposes BiCo, a training-free method to transfer task vectors by aligning bilinear coordinates in activation and gradient spaces using dual Procrustes mappings estimated from one forward-backward pass.

The new part is treating task vectors as accumulated bilinear interactions rather than just parameter offsets. This leads to formulating the transfer as a dual-space alignment problem. They estimate orthogonal mappings without updating parameters, which is efficient. The abstract reports that it outperforms existing methods on computer vision and natural language processing benchmarks for models that vary in size and training.

This approach does address the inefficiency of re-fine-tuning when base models change. The single-pass calibration is a practical plus if it works.

The potential issue is whether that one pass on a small calibration set captures sufficient correspondence for models that differ in width, depth, and pre-training. The paper notes that prior partial correspondences left a performance gap, so the bilinear combination needs to close it convincingly. If the bilinear derivation is not tight enough, the alignment could fall short in practice. It would help to see detailed ablations on how the method behaves with larger architectural differences or smaller calibration sets.

Overall, this is for researchers focused on model adaptation and reuse in large-scale ML. Someone looking for training-free transfer ideas would find the framing useful.

I would send it for peer review. The idea is novel enough and the problem important enough to warrant referee time, though the empirical robustness needs verification.

Referee Report

2 major / 2 minor

Summary. The paper claims that task vectors arise as accumulated bilinear interactions between input-side activations and output-side gradients rather than mere parameter offsets. It formulates transfer as a dual-space alignment problem and introduces BiCo, a training-free method that estimates orthogonal Procrustes mappings in activation and gradient spaces from a single forward-backward pass on a small calibration set, without parameter updates. Experiments on CV and NLP benchmarks show consistent gains over prior activation- or gradient-matching baselines when transferring across models that differ in width, depth, and pre-training.

Significance. If the bilinear derivation is exact and the single-pass dual alignment suffices to close the gap left by partial correspondences, the result would be significant: it offers a parameter-free route to reuse fine-tuning expertise across evolving base models, reducing repeated fine-tuning costs in both vision and language domains.

major comments (2)

[§3 (derivation of task vectors)] The central derivation that task vectors equal accumulated bilinear interactions (activations × gradients) is load-bearing for the dual Procrustes formulation; without an explicit expansion or proof that this equivalence holds exactly (rather than approximately), it is unclear why aligning both spaces with one calibration pass should succeed where single-space methods failed.
[§4 (BiCo construction) and experimental tables] The claim that a single forward-backward pass on a small calibration set yields sufficient correspondence for models differing in width, depth, and pre-training rests on the weakest assumption; prior partial correspondences left a performance gap, yet no theoretical bound or ablation demonstrates why the bilinear combination closes it.

minor comments (2)

[§4] Notation for the two Procrustes mappings (activation-space and gradient-space) should be introduced with explicit symbols and distinguished from the task vector itself to avoid reader confusion.
[Experiments] The abstract states 'extensive' benchmarks but does not reference specific effect sizes or failure cases; adding a table of per-task deltas relative to direct fine-tuning would strengthen the empirical section.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [§3 (derivation of task vectors)] The central derivation that task vectors equal accumulated bilinear interactions (activations × gradients) is load-bearing for the dual Procrustes formulation; without an explicit expansion or proof that this equivalence holds exactly (rather than approximately), it is unclear why aligning both spaces with one calibration pass should succeed where single-space methods failed.

Authors: Section 3 derives the task vector exactly as the sum of bilinear terms (activations outer-product gradients) by expanding the gradient of the fine-tuning loss with respect to the parameters of a linear layer. The equivalence follows directly from the chain rule under standard back-propagation assumptions and holds exactly for the definition of task vectors used throughout the literature; it is not an approximation. We will insert additional intermediate algebraic steps in the revised version to make the exactness explicit. revision: partial
Referee: [§4 (BiCo construction) and experimental tables] The claim that a single forward-backward pass on a small calibration set yields sufficient correspondence for models differing in width, depth, and pre-training rests on the weakest assumption; prior partial correspondences left a performance gap, yet no theoretical bound or ablation demonstrates why the bilinear combination closes it.

Authors: We do not supply a theoretical bound on the gap closure. The manuscript instead reports extensive empirical evidence (Section 5 and associated ablations) that the dual-space orthogonal Procrustes alignment, obtained from one forward-backward pass, consistently narrows the gap left by single-space baselines across width, depth, and pre-training mismatches. The bilinear construction is motivated by the exact derivation in §3, which supplies the rationale for why both spaces must be aligned; the experiments then validate that this suffices in practice. revision: no

standing simulated objections not resolved

A theoretical bound guaranteeing that bilinear dual-space alignment closes the performance gap for arbitrary model differences.

Circularity Check

0 steps flagged

No circularity: task-vector bilinear derivation presented as independent mathematical observation; alignment method uses external calibration data without self-referential reduction

full rationale

The paper's core derivation chain begins with the standard definition of a task vector as the parameter difference between fine-tuned and base models. It then claims to derive this as accumulated bilinear interactions between activations and gradients, which is presented as a first-principles observation rather than a redefinition or fit. The subsequent dual-space Procrustes alignment is estimated from a single forward-backward pass on an external calibration set and evaluated on benchmarks across differing architectures; no step reduces the claimed transfer performance to a fitted parameter renamed as prediction, nor relies on load-bearing self-citations or ansatzes imported from prior author work. The method is self-contained against external benchmarks with no evidence of the result being equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit fitting constants or new postulated objects are described.

pith-pipeline@v0.9.1-grok · 5773 in / 1078 out tokens · 37317 ms · 2026-06-29T14:19:10.332892+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 6 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa

Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InICLR, 2023

2023
[3]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

2021
[5]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. InEMNLP, 2015

2015
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[7]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 2017

2017
[8]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InICML, 2025

2025
[9]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InCVPR, 2023

2023
[10]

Mohamed, and A

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InCVPR, 2014

2014
[11]

Matszangosz, Gergely Papp, and Dániel Varga

Adrián Csiszárik, Péter K˝orösi-Szabó, Ákos K. Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. InNeurIPS, 2021

2021
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021
[13]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

2023
[14]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. InCVPR, 2025

2025
[15]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019. 10

2019
[16]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

2023
[17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

2019
[19]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV Workshop, 2013

2013
[20]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 1998

1998
[21]

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y . C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, and Cheng-Hao Kuo. Revisiting model stitching in the foundation model era. InCVPR, 2026

2026
[22]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

2011
[23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 2020

2020
[24]

Update your trans- former to the latest release: Re-basin of task vectors

Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele Rodolà, Simone Calderara, and Angelo Porrello. Update your trans- former to the latest release: Re-basin of task vectors. InICML, 2025

2025
[25]

Gradient-sign masking for task vector transport across pre- trained models

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone, Angelo Porrello, and Simone Calderara. Gradient-sign masking for task vector transport across pre- trained models. InICLR, 2026

2026
[26]

Transporting Task Vectors across Different Architectures without Training

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, and Simone Calder- ara. Transporting task vectors across different architectures without training.arXiv preprint arXiv:2602.12952, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Schönemann

Peter H. Schönemann. A generalized solution of the orthogonal procrustes problem.Psychome- trika, 1966

1966
[28]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models....

2022
[29]

Natural Language Understanding with the Quora Question Pairs Dataset

Lakshay Sharma, Laura Graesser, Nikita Nangia, and Utku Evci. Natural language understand- ing with the quora question pairs dataset.arXiv preprint arXiv:1907.01041, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[30]

The german traffic sign recognition benchmark: A multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InIJCNN, 2011

2011
[31]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. 11

2019
[33]

Localizing task information for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. InICML, 2024

2024
[34]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2022

2022
[35]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InNAACL-HLT, 2018

2018
[36]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo-Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. InCVPR, 2022

2022
[37]

Ehinger, Aude Oliva, and Antonio Torralba

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. InCVPR, 2010

2010
[38]

Ties-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InNeurIPS, 2023

2023
[39]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InICML, 2024. 12 Table of contents We provide the following items in this Appendix: • (A) Pseudocode of BiCo • (B) Activation consistency along fine-tuning trajectories • (C) Cross-width representational simi...

2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa

Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InICLR, 2023

2023

[3] [3]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

2021

[5] [5]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. InEMNLP, 2015

2015

[6] [6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020

[7] [7]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 2017

2017

[8] [8]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InICML, 2025

2025

[9] [9]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InCVPR, 2023

2023

[10] [10]

Mohamed, and A

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InCVPR, 2014

2014

[11] [11]

Matszangosz, Gergely Papp, and Dániel Varga

Adrián Csiszárik, Péter K˝orösi-Szabó, Ákos K. Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. InNeurIPS, 2021

2021

[12] [12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021

[13] [13]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

2023

[14] [14]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. InCVPR, 2025

2025

[15] [15]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019. 10

2019

[16] [16]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

2023

[17] [17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[18] [18]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

2019

[19] [19]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV Workshop, 2013

2013

[20] [20]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 1998

1998

[21] [21]

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y . C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, and Cheng-Hao Kuo. Revisiting model stitching in the foundation model era. InCVPR, 2026

2026

[22] [22]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

2011

[23] [23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 2020

2020

[24] [24]

Update your trans- former to the latest release: Re-basin of task vectors

Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele Rodolà, Simone Calderara, and Angelo Porrello. Update your trans- former to the latest release: Re-basin of task vectors. InICML, 2025

2025

[25] [25]

Gradient-sign masking for task vector transport across pre- trained models

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone, Angelo Porrello, and Simone Calderara. Gradient-sign masking for task vector transport across pre- trained models. InICLR, 2026

2026

[26] [26]

Transporting Task Vectors across Different Architectures without Training

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, and Simone Calder- ara. Transporting task vectors across different architectures without training.arXiv preprint arXiv:2602.12952, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Schönemann

Peter H. Schönemann. A generalized solution of the orthogonal procrustes problem.Psychome- trika, 1966

1966

[28] [28]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models....

2022

[29] [29]

Natural Language Understanding with the Quora Question Pairs Dataset

Lakshay Sharma, Laura Graesser, Nikita Nangia, and Utku Evci. Natural language understand- ing with the quora question pairs dataset.arXiv preprint arXiv:1907.01041, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[30] [30]

The german traffic sign recognition benchmark: A multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InIJCNN, 2011

2011

[31] [31]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. 11

2019

[33] [33]

Localizing task information for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. InICML, 2024

2024

[34] [34]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2022

2022

[35] [35]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InNAACL-HLT, 2018

2018

[36] [36]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo-Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. InCVPR, 2022

2022

[37] [37]

Ehinger, Aude Oliva, and Antonio Torralba

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. InCVPR, 2010

2010

[38] [38]

Ties-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InNeurIPS, 2023

2023

[39] [39]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InICML, 2024. 12 Table of contents We provide the following items in this Appendix: • (A) Pseudocode of BiCo • (B) Activation consistency along fine-tuning trajectories • (C) Cross-width representational simi...

2024