arxiv: 2605.13030 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

FeatCal: Feature Calibration for Post-Merging Models

Yanggan Gu , Shuo Cai , Zihao Wang , Wenjun Wang , Yuanyi Wang , Pengkai Wang , Sirui Huang , Su Lu

show 2 more authors

Jianmin Wu Hongxia Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords model mergingfeature driftcalibrationtask arithmeticCLIPGLUEpost-merging

0 comments

The pith

Feature drift in merged models decomposes into upstream propagation and local mismatch that can be corrected layer by layer using closed-form calibration on a small set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that performance drops after merging task experts arise from feature drift between the merged model and each expert on the same inputs. This drift is tracked through layers in forward order as the sum of changes propagated from earlier layers and fresh mismatches at the current layer, directly tying it to degraded outputs. The view leads to FeatCal, which solves for small weight adjustments at each layer using a closed-form expression on a tiny calibration set, cutting drift while staying near the original merged weights and keeping the speed and storage benefits of merging. A sympathetic reader cares because the method recovers most expert-level accuracy on vision and language benchmarks without gradients, extra parameters, or joint retraining.

Core claim

Feature drift between merged and expert models can be decomposed into upstream propagation and local mismatch; tracking this drift through layers in forward order links it to output degradation and motivates an efficient closed-form layer-wise calibration that reduces drift while remaining close to the merged weights.

What carries the argument

The decomposition of feature drift into upstream propagation and local mismatch, which enables derivation of layer-wise closed-form weight calibration updates.

If this is right

FeatCal reaches 85.5% accuracy on CLIP-ViT-B/32 Task Arithmetic versus 77.0% and 78.8% for Surgery and ProbSurgery.
On FLAN-T5-base GLUE it reaches 85.2% versus 83.7% and 82.2%.
Eight examples per task yield 82.9% on CLIP-ViT-B/32 while 256 examples finish in 53 seconds, roughly four times faster than the baselines.
No gradient descent, iterative optimization, or added modules are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same forward-order calibration could be applied to merging methods other than task arithmetic.
Feature-drift tracking may allow selective recalibration of only the layers where drift accumulates most.
The closed-form solution could support incremental merging by updating only new layers when additional tasks are added.

Load-bearing premise

The decomposition of feature drift into upstream propagation and local mismatch captures the dominant cause of output degradation, and the layer-wise closed-form calibration on a small set generalizes without harming unrelated capabilities.

What would settle it

Applying the FeatCal updates fails to reduce measured feature drift or raise accuracy above the uncalibrated merged model and the Surgery baselines on a held-out calibration set from the same tasks.

Figures

Figures reproduced from arXiv: 2605.13030 by Hongxia Yang, Jianmin Wu, Pengkai Wang, Shuo Cai, Sirui Huang, Su Lu, Wenjun Wang, Yanggan Gu, Yuanyi Wang, Zihao Wang.

**Figure 1.** Figure 1: Feature drift after Task Arithmetic (TA) merging and FEATCAL calibration in CLIP-ViTB/32. Panels (a,b) use Stanford Cars: FEATCAL moves features toward expert features, raises their mean cosine similarity from 0.60 to 0.84, and reduces mean L2 feature drift, with 46% less final-layer drift. Panel (c) reports per-task accuracy in the 8-task setting. App. B gives full 8-task feature views. ∗Equal contributi… view at source ↗

**Figure 2.** Figure 2: Feature-calibration diagnostics for TA on CLIP-ViT-B/32. (a) Task-wise final-feature [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Sample Efficiency. Sample efficiency and calibration cost [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Calibration examples under corruptions. Robustness to corrupted calibration data. We corrupt only the images used for postmerging calibration and evaluate on clean test sets from 8-task TA, isolating calibration data quality rather than test-time corruption robustness. Protocol details are in App. L. FEATCAL remains strongest in every reported setting, with a 78.0% average over clean, Gaussian noise, mot… view at source ↗

**Figure 5.** Figure 5: CLIP-ViT-B/32 TA coefficient sweeps. Ablation study [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: extends the Stanford Cars visualization in panel (a) of [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Feature-calibration diagnostics for TA w/ [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

read the original abstract

Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FeatCal gives a practical closed-form layer-wise calibration that beats the Surgery baselines on CLIP and GLUE with very few examples, though the forward-order drift fix leaves open questions about non-linear re-propagation.

read the letter

FeatCal decomposes feature drift into upstream propagation and local mismatch, then applies a closed-form weight update per layer in forward order on a small calibration set. This produces the reported gains: 85.5% on CLIP-ViT-B/32 Task Arithmetic versus 77-78% for the baselines, and 85.2% on FLAN-T5 GLUE versus 82-83%. With only 8 examples per task it still reaches 82.9%, and 256 examples finish in 53 seconds, roughly 4x faster than the comparators. The method stays close to the merged weights and avoids any gradient steps or extra modules, which is the practical payoff for post-merging work. The forward tracking of drift and the efficient solver are the clearest new pieces relative to the cited baselines. The empirical side is straightforward and the efficiency numbers are useful for anyone who merges models and then needs a quick fix. The soft spot is the assumption that sequential corrections do not materially alter the input distributions to later layers. With ReLU, GELU, or attention non-linearities, an early-layer update changes activation scales and statistics, so the local-mismatch solution computed for layer k may no longer be optimal once prior layers are adjusted. The abstract and available details give no analytic bound or post-correction drift measurement to show the residual stays small, so the theory's robustness on that point is not yet verified. This paper is for people working on model merging and lightweight post-training adaptation. Readers who need fast calibration without retraining will get concrete numbers and a simple procedure they can try. It has enough empirical grounding and a clear efficiency story to deserve a serious referee, even if the theory section will likely need more checks on the non-linear propagation issue. I would send it to review.

Referee Report

1 major / 1 minor

Summary. The paper claims that feature drift in merged models can be decomposed into upstream propagation and local mismatch, which propagates through layers in forward order and links to output degradation. This motivates FeatCal, a post-merging calibration method that applies layer-wise closed-form weight updates in forward order on a small calibration set to reduce drift while remaining close to the merged weights, without gradients, optimization, or extra modules. It reports benchmark wins over Surgery and ProbSurgery: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE, plus efficiency results (82.9% with 8 examples/task; 53s for 256 examples/task).

Significance. If the decomposition holds and sequential calibration generalizes without reintroducing drift or harming unrelated capabilities, FeatCal would offer a practical, low-cost way to close the gap between merged models and task experts, strengthening model merging as an alternative to joint training or multi-model deployment.

major comments (1)

[Method (drift decomposition and sequential update)] The central derivation assumes sequential forward-order closed-form calibration on local mismatch leaves residual upstream drift small after non-linearities (ReLU/GELU/attention) propagate changes in activation scale and distribution to later layers. No analytic bound, post-correction drift measurement, or ablation of the decomposition is supplied to verify that the full-pass residual remains negligible; this assumption is load-bearing for the claim that the method reduces total drift without re-solving the system.

minor comments (1)

[Abstract and Experiments] Abstract and results sections report benchmark numbers and timing but omit any error bars, statistical significance tests, or details on how the 8-example and 256-example regimes were sampled.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the drift decomposition. We address the major comment below and will revise the manuscript to incorporate additional verification.

read point-by-point responses

Referee: The central derivation assumes sequential forward-order closed-form calibration on local mismatch leaves residual upstream drift small after non-linearities (ReLU/GELU/attention) propagate changes in activation scale and distribution to later layers. No analytic bound, post-correction drift measurement, or ablation of the decomposition is supplied to verify that the full-pass residual remains negligible; this assumption is load-bearing for the claim that the method reduces total drift without re-solving the system.

Authors: We agree that the manuscript does not supply an analytic bound on residual upstream drift after non-linearities, nor does it report explicit post-correction drift measurements across layers or an ablation isolating the sequential decomposition. The derivation proceeds from the forward-order propagation of local mismatch and relies on the empirical observation that layer-wise closed-form updates reduce total drift without iterative re-solving. In the revision we will add: (i) layer-wise feature drift measurements before and after FeatCal on the calibration set, (ii) an ablation comparing sequential forward calibration against a simultaneous (non-sequential) variant, and (iii) a brief discussion of why a tight analytic bound is intractable for general non-linear activations while the empirical gains on CLIP and GLUE benchmarks support the practical utility of the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a decomposition of feature drift into upstream propagation and local mismatch, then derives a layer-wise closed-form weight update using a small held-out calibration set. This process computes updates directly from data examples and does not reduce by construction to fitted parameters, self-citations, or tautological definitions. The central claims rest on empirical improvements over baselines on CLIP and GLUE tasks rather than on any load-bearing self-citation or ansatz smuggled from prior work. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the feature-drift decomposition and the premise that local layer calibration suffices to counteract propagated effects; no explicit free parameters or new physical entities are introduced.

axioms (1)

domain assumption Feature drift is the primary driver of the performance gap between merged and expert models
The paper studies the gap exclusively through this lens and builds the calibration method upon it.

invented entities (1)

feature drift (decomposed into upstream propagation and local mismatch) no independent evidence
purpose: To quantify and track internal representation differences that explain output degradation
New conceptual framing introduced to motivate the calibration approach; no independent falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5597 in / 1401 out tokens · 44984 ms · 2026-05-14T20:23:54.310104+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our theory decomposes this drift into upstream propagation and local mismatch... closed-form solution to update model weights, with no gradient descent...
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer by layer in forward order... regularized regression problem with a closed-form update

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

[1]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022

work page 2022
[2]

Matena and Colin A

Michael S. Matena and Colin A. Raffel. Merging models with fisher-weighted averaging. In NeurIPS, 2022

work page 2022
[3]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

work page 2023
[4]

TIES-Merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-Merging: Resolving interference when merging models. InNeurIPS, 2023

work page 2023
[5]

AdaMerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. AdaMerging: Adaptive model merging for multi-task learning. InICLR, 2024

work page 2024
[6]

Language models are Super Mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are Super Mario: Absorbing abilities from homologous models as a free lunch. InICML, 2024

work page 2024
[7]

Model Breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model Breadcrumbs: Scaling multi-task model merging with sparse masks. InECCV, 2024

work page 2024
[8]

Merging by matching models in task parameter subspaces.TMLR, 2024

Derek Tam, Mohit Bansal, and Colin Raffel. Merging by matching models in task parameter subspaces.TMLR, 2024

work page 2024
[9]

Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan

Nico Daheim, Thomas Möllenhoff, Edoardo M. Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan. Model merging by uncertainty-based gradient matching. InICLR, 2024

work page 2024
[10]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InICML, 2025

work page 2025
[11]

Representation surgery for multi-task model merging

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. InICML, 2024. 10

work page 2024
[12]

Represen- tation surgery in model merging with probabilistic modeling

Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An. Represen- tation surgery in model merging with probabilistic modeling. InICML, 2025

work page 2025
[13]

SurgeryV2: Bridging the gap between model merging and multi-task learning with deep representation surgery.arXiv preprint arXiv:2410.14389, 2024

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xingwei Wang, Xiaocun Cao, Jie Zhang, and Dacheng Tao. SurgeryV2: Bridging the gap between model merging and multi-task learning with deep representation surgery.arXiv preprint arXiv:2410.14389, 2024

work page arXiv 2024
[14]

Parameter-efficient interventions for enhanced model merging

Marcin Osial, Daniel Marczak, and Bartosz Zieli ´nski. Parameter-efficient interventions for enhanced model merging. InProceedings of the 2025 SIAM International Conference on Data Mining, 2025

work page 2025
[15]

Task arithmetic in the tangent space: Improved editing of pre-trained models

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. InNeurIPS, 2023

work page 2023
[16]

Dataless knowledge fusion by merging weights of language models

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InICLR, 2023

work page 2023
[17]

RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

The-Hai Nguyen, Huu-Tien Dang, Takeshi Suzuki, and Le-Minh Nguyen. RegMean++: En- hancing effectiveness and generalization of regression mean for model merging.arXiv preprint arXiv:2508.03121, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Towards minimizing feature drift in model merging: Layer-wise task vector fusion for adaptive knowledge integration

Wenju Sun, Qingyong Li, Wen Wang, Yang Liu, Yangliao Geng, and Boyang Li. Towards minimizing feature drift in model merging: Layer-wise task vector fusion for adaptive knowledge integration. InNeurIPS, 2025

work page 2025
[19]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 1970

work page 1970
[20]

A. N. Tikhonov and V . Y . Arsenin.Solutions of Ill-posed Problems. V . H. Winston & Sons,

work page
[21]

Distributed solely by Halsted Press

work page
[22]

Fusionbench: A unified library and comprehensive benchmark for deep model fusion.JMLR, 2025

Anke Tang, Li Shen, Yong Luo, Enneng Yang, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Fusionbench: A unified library and comprehensive benchmark for deep model fusion.JMLR, 2025

work page 2025
[23]

Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833, 2025

Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. MergeBench: A benchmark for merging domain-specialized LLMs.arXiv preprint arXiv:2505.10833, 2025

work page arXiv 2025
[24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021
[25]

Ehinger, Aude Oliva, and Antonio Torralba

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. InCVPR, 2010

work page 2010
[26]

3D object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. InICCV Workshops, 2013

work page 2013
[27]

Remote sensing image scene classification: Benchmark and state of the art.Proc

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proc. IEEE, 2017

work page 2017
[28]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

work page 2019
[29]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

work page 2011
[30]

The german traffic sign recognition benchmark: A multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InThe 2011 International Joint Conference on Neural Networks, 2011. 11

work page 2011
[31]

Gradient-based learning applied to document recognition.Proc

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proc. IEEE, 1998

work page 1998
[32]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, 2014

work page 2014
[33]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIndian Conference on Computer Vision, Graphics and Image Processing, 2008

work page 2008
[34]

Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling

Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant CNNs for digital pathology. InMedical Image Computing and Computer Assisted Intervention, 2018

work page 2018
[35]

Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Ji...

work page 2015
[36]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In CVPR, 2012

work page 2012
[37]

Ng, and Honglak Lee

Adam Coates, Andrew Y . Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011

work page 2011
[38]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[39]

Food-101: Mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101: Mining discriminative components with random forests. InECCV, 2014

work page 2014
[40]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

EMNIST: Extending MNIST to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre van Schaik. EMNIST: Extending MNIST to handwritten letters. InInternational Joint Conference on Neural Networks, 2017

work page 2017
[42]

Deep learning for classical japanese literature

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. InNeurIPS Workshop on Machine Learning for Creativity and Design, 2018

work page 2018
[43]

Manning, Andrew Y

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y . Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InEMNLP, 2013

work page 2013
[44]

Rendered SST-2 Dataset

OpenAI. Rendered SST-2 Dataset. https://github.com/openai/CLIP/blob/main/ data/rendered-sst2.md, 2021

work page 2021
[45]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020

work page 2020
[46]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2022

work page 2022
[47]

Dai, Hongkun Yu, Slav Petrov, Ed H

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

work page 2024
[48]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

work page 2018
[49]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments.TACL, 2019

work page 2019
[50]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InNAACL-HLT, 2018

work page 2018
[51]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing, 2005

work page 2005
[52]

SQuAD: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEMNLP, 2016

work page 2016
[53]

The PASCAL recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. InMachine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006

work page 2006
[54]

SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InSemEval, 2017

work page 2017
[55]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

work page 2022
[56]

Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 2026

Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Zhen Li, Chi Yung Chung, and Hongxia Yang. Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 2026

work page 2026
[57]

Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhijie Sang, Zhaoyi Yan, Zhen Li, Shengyu Zhang, Fei Wu, and Hongxia Yang. Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

work page 2025
[58]

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

MergePipe: A budget-aware parameter management system for scalable LLM merging.arXiv preprint arXiv:2602.13273, 2026

Yuanyi Wang, Yanggan Gu, Zihao Wang, Kunxi Li, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. MergePipe: A budget-aware parameter management system for scalable LLM merging.arXiv preprint arXiv:2602.13273, 2026

work page arXiv 2026
[60]

Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. InfiFPO: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

work page arXiv 2025
[61]

Capturing nuanced preferences: Preference-aligned distillation for small language models

Yanggan Gu, Junzhuo Li, Sirui Huang, Xin Zou, Zhenghua Li, and Xuming Hu. Capturing nuanced preferences: Preference-aligned distillation for small language models. InFindings of ACL, 2025

work page 2025
[62]

Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. InfiGFusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025

work page arXiv 2025
[63]

Exploring response uncertainty in MLLMs: An empirical evaluation under misleading scenarios

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, and Xuming Hu. Exploring response uncertainty in MLLMs: An empirical evaluation under misleading scenarios. InEMNLP, 2025

work page 2025
[64]

InfiCoEvalChain: A blockchain-based decentralized framework for collaborative LLM evaluation.arXiv preprint arXiv:2602.08229, 2026

Yifan Yang, Jinjia Li, Kunxi Li, Puhao Zheng, Yuanyi Wang, Zheyan Qu, Yang Yu, Jianmin Wu, Ming Li, and Hongxia Yang. InfiCoEvalChain: A blockchain-based decentralized framework for collaborative LLM evaluation.arXiv preprint arXiv:2602.08229, 2026. 13

work page arXiv 2026
[65]

InfiR2: A comprehensive FP8 training recipe for reasoning-enhanced language models.arXiv preprint arXiv:2509.22536, 2025

Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, and Hongxia Yang. InfiR2: A comprehensive FP8 training recipe for reasoning-enhanced language models.arXiv preprint arXiv:2509.22536, 2025. 14 A Limitations Computational scope.Due to limited compute, our experiments focus on CLIP and FLAN-T5. We do...

work page arXiv 2025