pith. machine review for the scientific record. sign in

arxiv: 2605.13030 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

FeatCal: Feature Calibration for Post-Merging Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords model mergingfeature driftcalibrationtask arithmeticCLIPGLUEpost-merging
0
0 comments X

The pith

Feature drift in merged models decomposes into upstream propagation and local mismatch that can be corrected layer by layer using closed-form calibration on a small set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that performance drops after merging task experts arise from feature drift between the merged model and each expert on the same inputs. This drift is tracked through layers in forward order as the sum of changes propagated from earlier layers and fresh mismatches at the current layer, directly tying it to degraded outputs. The view leads to FeatCal, which solves for small weight adjustments at each layer using a closed-form expression on a tiny calibration set, cutting drift while staying near the original merged weights and keeping the speed and storage benefits of merging. A sympathetic reader cares because the method recovers most expert-level accuracy on vision and language benchmarks without gradients, extra parameters, or joint retraining.

Core claim

Feature drift between merged and expert models can be decomposed into upstream propagation and local mismatch; tracking this drift through layers in forward order links it to output degradation and motivates an efficient closed-form layer-wise calibration that reduces drift while remaining close to the merged weights.

What carries the argument

The decomposition of feature drift into upstream propagation and local mismatch, which enables derivation of layer-wise closed-form weight calibration updates.

If this is right

  • FeatCal reaches 85.5% accuracy on CLIP-ViT-B/32 Task Arithmetic versus 77.0% and 78.8% for Surgery and ProbSurgery.
  • On FLAN-T5-base GLUE it reaches 85.2% versus 83.7% and 82.2%.
  • Eight examples per task yield 82.9% on CLIP-ViT-B/32 while 256 examples finish in 53 seconds, roughly four times faster than the baselines.
  • No gradient descent, iterative optimization, or added modules are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same forward-order calibration could be applied to merging methods other than task arithmetic.
  • Feature-drift tracking may allow selective recalibration of only the layers where drift accumulates most.
  • The closed-form solution could support incremental merging by updating only new layers when additional tasks are added.

Load-bearing premise

The decomposition of feature drift into upstream propagation and local mismatch captures the dominant cause of output degradation, and the layer-wise closed-form calibration on a small set generalizes without harming unrelated capabilities.

What would settle it

Applying the FeatCal updates fails to reduce measured feature drift or raise accuracy above the uncalibrated merged model and the Surgery baselines on a held-out calibration set from the same tasks.

Figures

Figures reproduced from arXiv: 2605.13030 by Hongxia Yang, Jianmin Wu, Pengkai Wang, Shuo Cai, Sirui Huang, Su Lu, Wenjun Wang, Yanggan Gu, Yuanyi Wang, Zihao Wang.

Figure 1
Figure 1. Figure 1: Feature drift after Task Arithmetic (TA) merging and FEATCAL calibration in CLIP-ViT￾B/32. Panels (a,b) use Stanford Cars: FEATCAL moves features toward expert features, raises their mean cosine similarity from 0.60 to 0.84, and reduces mean L2 feature drift, with 46% less final-layer drift. Panel (c) reports per-task accuracy in the 8-task setting. App. B gives full 8-task feature views. ∗Equal contributi… view at source ↗
Figure 2
Figure 2. Figure 2: Feature-calibration diagnostics for TA on CLIP-ViT-B/32. (a) Task-wise final-feature [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample Efficiency. Sample efficiency and calibration cost [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calibration examples under corruptions. Robustness to corrupted calibration data. We corrupt only the images used for post￾merging calibration and evaluate on clean test sets from 8-task TA, isolating calibration data quality rather than test-time corruption robust￾ness. Protocol details are in App. L. FEATCAL remains strongest in every reported setting, with a 78.0% average over clean, Gaussian noise, mot… view at source ↗
Figure 5
Figure 5. Figure 5: CLIP-ViT-B/32 TA coefficient sweeps. Ablation study [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: extends the Stanford Cars visualization in panel (a) of [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Feature-calibration diagnostics for TA w/ [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
read the original abstract

Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that feature drift in merged models can be decomposed into upstream propagation and local mismatch, which propagates through layers in forward order and links to output degradation. This motivates FeatCal, a post-merging calibration method that applies layer-wise closed-form weight updates in forward order on a small calibration set to reduce drift while remaining close to the merged weights, without gradients, optimization, or extra modules. It reports benchmark wins over Surgery and ProbSurgery: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE, plus efficiency results (82.9% with 8 examples/task; 53s for 256 examples/task).

Significance. If the decomposition holds and sequential calibration generalizes without reintroducing drift or harming unrelated capabilities, FeatCal would offer a practical, low-cost way to close the gap between merged models and task experts, strengthening model merging as an alternative to joint training or multi-model deployment.

major comments (1)
  1. [Method (drift decomposition and sequential update)] The central derivation assumes sequential forward-order closed-form calibration on local mismatch leaves residual upstream drift small after non-linearities (ReLU/GELU/attention) propagate changes in activation scale and distribution to later layers. No analytic bound, post-correction drift measurement, or ablation of the decomposition is supplied to verify that the full-pass residual remains negligible; this assumption is load-bearing for the claim that the method reduces total drift without re-solving the system.
minor comments (1)
  1. [Abstract and Experiments] Abstract and results sections report benchmark numbers and timing but omit any error bars, statistical significance tests, or details on how the 8-example and 256-example regimes were sampled.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the drift decomposition. We address the major comment below and will revise the manuscript to incorporate additional verification.

read point-by-point responses
  1. Referee: The central derivation assumes sequential forward-order closed-form calibration on local mismatch leaves residual upstream drift small after non-linearities (ReLU/GELU/attention) propagate changes in activation scale and distribution to later layers. No analytic bound, post-correction drift measurement, or ablation of the decomposition is supplied to verify that the full-pass residual remains negligible; this assumption is load-bearing for the claim that the method reduces total drift without re-solving the system.

    Authors: We agree that the manuscript does not supply an analytic bound on residual upstream drift after non-linearities, nor does it report explicit post-correction drift measurements across layers or an ablation isolating the sequential decomposition. The derivation proceeds from the forward-order propagation of local mismatch and relies on the empirical observation that layer-wise closed-form updates reduce total drift without iterative re-solving. In the revision we will add: (i) layer-wise feature drift measurements before and after FeatCal on the calibration set, (ii) an ablation comparing sequential forward calibration against a simultaneous (non-sequential) variant, and (iii) a brief discussion of why a tight analytic bound is intractable for general non-linear activations while the empirical gains on CLIP and GLUE benchmarks support the practical utility of the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a decomposition of feature drift into upstream propagation and local mismatch, then derives a layer-wise closed-form weight update using a small held-out calibration set. This process computes updates directly from data examples and does not reduce by construction to fitted parameters, self-citations, or tautological definitions. The central claims rest on empirical improvements over baselines on CLIP and GLUE tasks rather than on any load-bearing self-citation or ansatz smuggled from prior work. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the feature-drift decomposition and the premise that local layer calibration suffices to counteract propagated effects; no explicit free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Feature drift is the primary driver of the performance gap between merged and expert models
    The paper studies the gap exclusively through this lens and builds the calibration method upon it.
invented entities (1)
  • feature drift (decomposed into upstream propagation and local mismatch) no independent evidence
    purpose: To quantify and track internal representation differences that explain output degradation
    New conceptual framing introduced to motivate the calibration approach; no independent falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5597 in / 1401 out tokens · 44984 ms · 2026-05-14T20:23:54.310104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

  1. [1]

    Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022

  2. [2]

    Matena and Colin A

    Michael S. Matena and Colin A. Raffel. Merging models with fisher-weighted averaging. In NeurIPS, 2022

  3. [3]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InICLR, 2023

  4. [4]

    TIES-Merging: Resolving interference when merging models

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-Merging: Resolving interference when merging models. InNeurIPS, 2023

  5. [5]

    AdaMerging: Adaptive model merging for multi-task learning

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. AdaMerging: Adaptive model merging for multi-task learning. InICLR, 2024

  6. [6]

    Language models are Super Mario: Absorbing abilities from homologous models as a free lunch

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are Super Mario: Absorbing abilities from homologous models as a free lunch. InICML, 2024

  7. [7]

    Model Breadcrumbs: Scaling multi-task model merging with sparse masks

    MohammadReza Davari and Eugene Belilovsky. Model Breadcrumbs: Scaling multi-task model merging with sparse masks. InECCV, 2024

  8. [8]

    Merging by matching models in task parameter subspaces.TMLR, 2024

    Derek Tam, Mohit Bansal, and Colin Raffel. Merging by matching models in task parameter subspaces.TMLR, 2024

  9. [9]

    Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan

    Nico Daheim, Thomas Möllenhoff, Edoardo M. Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan. Model merging by uncertainty-based gradient matching. InICLR, 2024

  10. [10]

    Whoever started the interference should end it: Guiding data-free model merging via task vectors

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InICML, 2025

  11. [11]

    Representation surgery for multi-task model merging

    Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. InICML, 2024. 10

  12. [12]

    Represen- tation surgery in model merging with probabilistic modeling

    Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An. Represen- tation surgery in model merging with probabilistic modeling. InICML, 2025

  13. [13]

    SurgeryV2: Bridging the gap between model merging and multi-task learning with deep representation surgery.arXiv preprint arXiv:2410.14389, 2024

    Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xingwei Wang, Xiaocun Cao, Jie Zhang, and Dacheng Tao. SurgeryV2: Bridging the gap between model merging and multi-task learning with deep representation surgery.arXiv preprint arXiv:2410.14389, 2024

  14. [14]

    Parameter-efficient interventions for enhanced model merging

    Marcin Osial, Daniel Marczak, and Bartosz Zieli ´nski. Parameter-efficient interventions for enhanced model merging. InProceedings of the 2025 SIAM International Conference on Data Mining, 2025

  15. [15]

    Task arithmetic in the tangent space: Improved editing of pre-trained models

    Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. InNeurIPS, 2023

  16. [16]

    Dataless knowledge fusion by merging weights of language models

    Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. InICLR, 2023

  17. [17]

    RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

    The-Hai Nguyen, Huu-Tien Dang, Takeshi Suzuki, and Le-Minh Nguyen. RegMean++: En- hancing effectiveness and generalization of regression mean for model merging.arXiv preprint arXiv:2508.03121, 2025

  18. [18]

    Towards minimizing feature drift in model merging: Layer-wise task vector fusion for adaptive knowledge integration

    Wenju Sun, Qingyong Li, Wen Wang, Yang Liu, Yangliao Geng, and Boyang Li. Towards minimizing feature drift in model merging: Layer-wise task vector fusion for adaptive knowledge integration. InNeurIPS, 2025

  19. [19]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 1970

  20. [20]

    A. N. Tikhonov and V . Y . Arsenin.Solutions of Ill-posed Problems. V . H. Winston & Sons,

  21. [21]

    Distributed solely by Halsted Press

  22. [22]

    Fusionbench: A unified library and comprehensive benchmark for deep model fusion.JMLR, 2025

    Anke Tang, Li Shen, Yong Luo, Enneng Yang, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Fusionbench: A unified library and comprehensive benchmark for deep model fusion.JMLR, 2025

  23. [23]

    Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833, 2025

    Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. MergeBench: A benchmark for merging domain-specialized LLMs.arXiv preprint arXiv:2505.10833, 2025

  24. [24]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

  25. [25]

    Ehinger, Aude Oliva, and Antonio Torralba

    Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. InCVPR, 2010

  26. [26]

    3D object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. InICCV Workshops, 2013

  27. [27]

    Remote sensing image scene classification: Benchmark and state of the art.Proc

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proc. IEEE, 2017

  28. [28]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

  29. [29]

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

  30. [30]

    The german traffic sign recognition benchmark: A multi-class classification competition

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InThe 2011 International Joint Conference on Neural Networks, 2011. 11

  31. [31]

    Gradient-based learning applied to document recognition.Proc

    Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proc. IEEE, 1998

  32. [32]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, 2014

  33. [33]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIndian Conference on Computer Vision, Graphics and Image Processing, 2008

  34. [34]

    Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling

    Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant CNNs for digital pathology. InMedical Image Computing and Computer Assisted Intervention, 2018

  35. [35]

    Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Ji...

  36. [36]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In CVPR, 2012

  37. [37]

    Ng, and Honglak Lee

    Adam Coates, Andrew Y . Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011

  38. [38]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  39. [39]

    Food-101: Mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101: Mining discriminative components with random forests. InECCV, 2014

  40. [40]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

  41. [41]

    EMNIST: Extending MNIST to handwritten letters

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre van Schaik. EMNIST: Extending MNIST to handwritten letters. InInternational Joint Conference on Neural Networks, 2017

  42. [42]

    Deep learning for classical japanese literature

    Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. InNeurIPS Workshop on Machine Learning for Creativity and Design, 2018

  43. [43]

    Manning, Andrew Y

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y . Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InEMNLP, 2013

  44. [44]

    Rendered SST-2 Dataset

    OpenAI. Rendered SST-2 Dataset. https://github.com/openai/CLIP/blob/main/ data/rendered-sst2.md, 2021

  45. [45]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020

  46. [46]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2022

  47. [47]

    Dai, Hongkun Yu, Slav Petrov, Ed H

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

  48. [48]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

  49. [49]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments.TACL, 2019

  50. [50]

    Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InNAACL-HLT, 2018

  51. [51]

    Dolan and Chris Brockett

    William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing, 2005

  52. [52]

    SQuAD: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEMNLP, 2016

  53. [53]

    The PASCAL recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. InMachine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006

  54. [54]

    SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InSemEval, 2017

  55. [55]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

  56. [56]

    Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 2026

    Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Zhen Li, Chi Yung Chung, and Hongxia Yang. Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 2026

  57. [57]

    Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

    Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhijie Sang, Zhaoyi Yan, Zhen Li, Shengyu Zhang, Fei Wu, and Hongxia Yang. Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

  58. [58]

    Model Merging Scaling Laws in Large Language Models

    Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025

  59. [59]

    MergePipe: A budget-aware parameter management system for scalable LLM merging.arXiv preprint arXiv:2602.13273, 2026

    Yuanyi Wang, Yanggan Gu, Zihao Wang, Kunxi Li, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. MergePipe: A budget-aware parameter management system for scalable LLM merging.arXiv preprint arXiv:2602.13273, 2026

  60. [60]

    Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

    Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. InfiFPO: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

  61. [61]

    Capturing nuanced preferences: Preference-aligned distillation for small language models

    Yanggan Gu, Junzhuo Li, Sirui Huang, Xin Zou, Zhenghua Li, and Xuming Hu. Capturing nuanced preferences: Preference-aligned distillation for small language models. InFindings of ACL, 2025

  62. [62]

    Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion

    Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. InfiGFusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025

  63. [63]

    Exploring response uncertainty in MLLMs: An empirical evaluation under misleading scenarios

    Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, and Xuming Hu. Exploring response uncertainty in MLLMs: An empirical evaluation under misleading scenarios. InEMNLP, 2025

  64. [64]

    InfiCoEvalChain: A blockchain-based decentralized framework for collaborative LLM evaluation.arXiv preprint arXiv:2602.08229, 2026

    Yifan Yang, Jinjia Li, Kunxi Li, Puhao Zheng, Yuanyi Wang, Zheyan Qu, Yang Yu, Jianmin Wu, Ming Li, and Hongxia Yang. InfiCoEvalChain: A blockchain-based decentralized framework for collaborative LLM evaluation.arXiv preprint arXiv:2602.08229, 2026. 13

  65. [65]

    InfiR2: A comprehensive FP8 training recipe for reasoning-enhanced language models.arXiv preprint arXiv:2509.22536, 2025

    Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, and Hongxia Yang. InfiR2: A comprehensive FP8 training recipe for reasoning-enhanced language models.arXiv preprint arXiv:2509.22536, 2025. 14 A Limitations Computational scope.Due to limited compute, our experiments focus on CLIP and FLAN-T5. We do...