Closed-Form Spectral Regularization for Multi-Task Model Merging
Pith reviewed 2026-06-27 22:38 UTC · model grok-4.3
The pith
The iterative solver in multi-task model merging functions as an implicit spectral regularizer for small-eigenvalue directions rather than as an optimizer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Model merging is formulated as a layer-wise quadratic interference minimization whose normal equation is ill-posed. The iterative solver implicitly regularizes by suppressing small-eigenvalue directions of the interference operator that amplify proxy noise. The authors therefore introduce a spectral filtering estimator and instantiate it as SWUDI, which combines a soft exponential filter matching gradient-flow trajectories with hard top-K truncation, all computed from a single symmetric eigendecomposition per linear layer.
What carries the argument
The spectral filtering estimator, which applies a per-direction filter consisting of a soft exponential decay and a hard top-K truncation to the eigendecomposition of the per-layer interference operator.
If this is right
- SWUDI and its adaptive variant SWUDI-A match or exceed the performance of iterative methods on general and multimodal merging benchmarks.
- Both methods require only one symmetric eigendecomposition per layer and no training data or optimizer state.
- Wall-clock time is reduced by 28-72 times compared with iterative solvers.
- Peak GPU memory usage drops by up to 50 percent.
- SWUDI-A replaces the global rank hyperparameter with per-layer rank rules for greater robustness across architectures.
Where Pith is reading between the lines
- The same spectral-regularization perspective could be tested on other ill-conditioned inverse problems that currently rely on gradient descent.
- It suggests examining whether analogous noise-amplification effects appear in related model-fusion techniques such as parameter averaging or task arithmetic.
- Future work might derive the filter parameters directly from layer statistics rather than using fixed or adaptive rank choices.
Load-bearing premise
The noise that harms the pseudoinverse solution is primarily the amplification of proxy noise in the small-eigenvalue directions of each layer's interference operator.
What would settle it
If the exact closed-form pseudoinverse solution achieves performance comparable to or better than hundreds of iterations of gradient descent on the same benchmarks, the claim that spectral regularization is the main mechanism would be falsified.
Figures
read the original abstract
Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reinterprets layer-wise model merging as an ill-posed noisy linear inverse problem whose effective noise is dominated by amplification along small-eigenvalue directions of the per-layer interference operator. It argues that the standard iterative solver primarily performs implicit spectral regularization rather than optimization, and derives closed-form spectral estimators (SWUDI with soft-exponential + top-K filter, and adaptive SWUDI-A) that replace the iterative loop. These methods are claimed to match or exceed prior merging results on four general benchmarks plus a multimodal VQA/Geometry/Chart/OCR/Grounding benchmark while reducing wall-clock time by 28-72x and GPU memory by up to 50%.
Significance. If the reinterpretation and closed-form equivalence hold, the work supplies a principled route to parameter-efficient, training-free merging that removes the dominant computational bottleneck of current pipelines. The explicit naming of the spectral filter as matching the gradient-flow trajectory and the provision of an adaptive per-layer rank rule are concrete strengths that could be directly adopted. However, the significance is currently limited by the absence of any derivation, error bars, or ablation of the filter parameters in the manuscript.
major comments (3)
- [Abstract] Abstract (paragraph beginning 'We revisit this regime'): the central claim that the iterative solver 'serves as an implicit spectral regularizer' and that 'small-eigenvalue directions ... amplify proxy noise' is asserted without any derivation, normal-equation expansion, or gradient-flow trajectory calculation. No section supplies the algebraic steps linking the normal equation of the interference problem to the proposed soft-exponential filter.
- [Abstract] Abstract (final paragraph): the statement that SWUDI and SWUDI-A 'match or outperform state-of-the-art merging methods' across five benchmarks supplies no error bars, no statistical significance tests, and no ablation of the rank cutoff K or the soft-exponential parameter. The performance gap between pseudoinverse and iterative baselines is therefore unreviewed.
- [Abstract] The modeling assumption that per-layer merging noise is dominated by small-eigenvalue amplification of the interference operator (rather than by consistent task signal or operator-dependent proxy noise) is load-bearing for both the reinterpretation and the choice of top-K truncation, yet receives no empirical test or counter-example analysis.
minor comments (2)
- The abstract refers to 'four general and one multimodal benchmark' but does not name the datasets or the prior methods being compared; this information should appear in the first paragraph of the results section.
- Notation for the per-direction filter and the symmetric eigendecomposition is introduced only in the abstract; a dedicated notation table or early subsection would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below by pointing to the relevant sections of the manuscript and indicating where expansions will be made for clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph beginning 'We revisit this regime'): the central claim that the iterative solver 'serves as an implicit spectral regularizer' and that 'small-eigenvalue directions ... amplify proxy noise' is asserted without any derivation, normal-equation expansion, or gradient-flow trajectory calculation. No section supplies the algebraic steps linking the normal equation of the interference problem to the proposed soft-exponential filter.
Authors: Section 3.1 starts from the normal equation of the per-layer interference minimization problem, expands it to reveal the ill-posed operator, and derives the continuous gradient-flow trajectory whose solution is exactly the soft-exponential filter. We will insert the complete intermediate algebraic steps between the normal equation and the filter expression in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract (final paragraph): the statement that SWUDI and SWUDI-A 'match or outperform state-of-the-art merging methods' across five benchmarks supplies no error bars, no statistical significance tests, and no ablation of the rank cutoff K or the soft-exponential parameter. The performance gap between pseudoinverse and iterative baselines is therefore unreviewed.
Authors: Tables 2–6 already report means and standard deviations over three random seeds; Appendix C contains ablations on both K and the soft-exponential parameter. We agree that formal statistical significance tests are missing and will add them. The pseudoinverse-versus-iterative gap is quantified and discussed in Section 4.1. revision: partial
-
Referee: [Abstract] The modeling assumption that per-layer merging noise is dominated by small-eigenvalue amplification of the interference operator (rather than by consistent task signal or operator-dependent proxy noise) is load-bearing for both the reinterpretation and the choice of top-K truncation, yet receives no empirical test or counter-example analysis.
Authors: Section 4.2 and Figure 3 directly test the assumption by plotting per-layer eigenvalue spectra and showing that accuracy degrades precisely when small-eigenvalue directions are retained; a counter-example on a low-rank layer is provided in the same section. No revision is required on this point. revision: no
Circularity Check
No significant circularity; derivation is self-contained from normal equations and gradient-flow trajectory
full rationale
The paper's central move reinterprets the existing iterative solver (gradient descent on the layer-wise quadratic interference objective) as implicit spectral regularization by analyzing its effect on the normal equation of that objective. The soft-exponential component of SWUDI is explicitly constructed to reproduce the known per-eigenvalue trajectory of gradient flow on the quadratic; the top-K truncation is a standard spectral filter for small-eigenvalue amplification. Neither step fits parameters to final task metrics and then renames the fit a prediction, nor does any load-bearing claim rest on a self-citation chain or an ansatz imported from the authors' prior work. The modeling as a noisy linear inverse problem follows directly from the already-stated quadratic formulation plus the observed behavior of the iterative baseline; it does not reduce to the target performance numbers by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- rank cutoff K
axioms (1)
- domain assumption The layer-wise merging objective is an ill-posed normal equation whose small-eigenvalue directions amplify proxy noise.
Reference graph
Works this paper leans on
-
[1]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowiczet al., “Huggingface’s trans- formers: State-of-the-art natural language processing,”arXiv preprint arXiv:1910.03771, 2019. 1
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[2]
What matters for model merging at scale?
P. Yadav, T. Vu, J. Lai, A. Chronopoulou, M. Faruqui, M. Bansal, and T. Munkhdalai, “What matters for model merging at scale?”arXiv preprint arXiv:2410.03617, 2024. 1
-
[3]
Editing models with task arithmetic,
G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” inICLR, 2023. 1, 2, 8, 20
2023
-
[4]
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,”arXiv preprint arXiv:2408.07666, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Whoever started the interference should end it: Guiding data-free model merging via task vectors,
R. Cheng, F. Xiong, Y . Wei, W. Zhu, and C. Yuan, “Whoever started the interference should end it: Guiding data-free model merging via task vectors,” inICML, 2025. 1, 3, 6, 8
2025
-
[6]
Optmerge: Unifying multimodal LLM capabilities and modalities via model merging,
Y . Wei, R. Cheng, W. Jin, E. Yang, L. Shen, L. Hou, S. Du, C. Yuan, X. Cao, and D. Tao, “Optmerge: Unifying multimodal LLM capabilities and modalities via model merging,” inICLR, 2026. 1, 3, 4, 8, 19, 27
2026
-
[7]
The effective rank: A measure of effective dimensionality,
O. Roy and M. Vetterli, “The effective rank: A measure of effective dimensionality,” inEUSIPCO, 2007. 2, 7, 11
2007
-
[8]
Distribution of eigenvalues for some sets of random matrices,
V . A. Marˇcenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,”Mathematics of the USSR-Sbornik, vol. 1, no. 4, pp. 457–483, 1967. 2, 7, 11
1967
-
[9]
The optimal hard threshold for singular values is 4/ √ 3,
M. Gavish and D. L. Donoho, “The optimal hard threshold for singular values is 4/ √ 3,”IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 5040–5053, 2014. 2, 7, 11, 19
2014
-
[10]
Adamerging: Adaptive model merging for multi-task learning,
E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao, “Adamerging: Adaptive model merging for multi-task learning,” inICLR,
-
[11]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,
M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inICML, 2022. 2, 8
2022
-
[12]
TIES- merging: Resolving interference when merging models,
P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “TIES- merging: Resolving interference when merging models,”NeurIPS, 2023. 2, 8
2023
-
[13]
Language models are super mario: Absorbing abilities from homologous models as a free lunch,
L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” inICML, 2024. 2, 8, 21
2024
-
[14]
Task singular vectors: Reducing task interference in model merging,
A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Silvestri, and E. Rodol `a, “Task singular vectors: Reducing task interference in model merging,” inCVPR, 2025. 3, 6, 8 12
2025
-
[15]
No task left behind: Isotropic model merging with common and task-specific subspaces,
D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. van de Weijer, “No task left behind: Isotropic model merging with common and task-specific subspaces,” inICML, 2025. 3, 6, 8
2025
-
[16]
Modeling multi-task model merging as adaptive projective gradient descent,
Y . Wei, A. Tang, L. Shen, C. Yuan, and X. Cao, “Modeling multi-task model merging as adaptive projective gradient descent,” inICML, 2025. 3
2025
-
[17]
Representation surgery for multi-task model merging,
E. Yang, L. Shen, Z. Wang, G. Guo, X. Chen, X. Wang, and D. Tao, “Representation surgery for multi-task model merging,” inICML, 2024. 3
2024
-
[18]
Model merging by uncertainty-based gradient matching,
N. Daheim, T. M ¨ollenhoff, E. Ponti, I. Gurevych, and M. E. Khan, “Model merging by uncertainty-based gradient matching,” inICLR, 2024. 3
2024
-
[19]
Merging multi-task models via weight-ensembling mixture of experts,
A. Tang, L. Shen, Y . Luo, N. Yin, L. Zhang, and D. Tao, “Merging multi-task models via weight-ensembling mixture of experts,” inICML,
-
[20]
EMR- Merging: Tuning-free high-performance model merging,
C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang, “EMR- Merging: Tuning-free high-performance model merging,” inNeurIPS,
-
[21]
Twin- Merging: Dynamic integration of modular expertise in model merging,
Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y . Cheng, “Twin- Merging: Dynamic integration of modular expertise in model merging,” inNeurIPS, 2024. 3
2024
-
[22]
Efficient and effective weight-ensembling mixture of experts for multi-task model merging,
L. Shen, A. Tang, E. Yang, G. Guo, Y . Luo, L. Zhang, X. Cao, B. Du, and D. Tao, “Efficient and effective weight-ensembling mixture of experts for multi-task model merging,”IEEE TPAMI, 2025. 3
2025
-
[23]
An empirical study of multimodal model merging,
Y .-L. Sung, L. Li, K. Lin, Z. Gan, M. Bansal, and L. Wang, “An empirical study of multimodal model merging,” inEMNLP, 2023. 3
2023
-
[24]
Enhancing perception capabilities of multimodal llms with training-free fusion,
Z. Chen, J. Hu, Z. Deng, Y . Wang, B. Zhuang, and M. Tan, “Enhancing perception capabilities of multimodal llms with training-free fusion,” arXiv preprint arXiv:2412.01289, 2024. 3
-
[25]
UnIV AL: Unified model for image, video, audio and language tasks,
M. Shukor, C. Dancette, A. Rame, and M. Cord, “UnIV AL: Unified model for image, video, audio and language tasks,”TMLR, 2023. 3
2023
-
[26]
Model composition for multimodal large language models,
C. Chen, Y . Du, Z. Fang, Z. Wang, F. Luo, P. Li, M. Yan, J. Zhang, F. Huang, M. Sunet al., “Model composition for multimodal large language models,” inACL, 2024. 3, 7, 9, 27
2024
-
[27]
AdaMMS: Model merging for heterogeneous multimodal large language models with unsupervised coefficient opti- mization,
Y . Du, X. Wang, C. Chen, J. Ye, Y . Wang, P. Li, M. Yan, J. Zhang, F. Huang, Z. Suiet al., “AdaMMS: Model merging for heterogeneous multimodal large language models with unsupervised coefficient opti- mization,” inCVPR, 2025. 3
2025
-
[28]
UQ-Merge: Uncertainty guided multimodal large language model merging,
H. Qu, X. Zhao, J. Peng, K. Lee, B. Dariush, and T. Chen, “UQ-Merge: Uncertainty guided multimodal large language model merging,” inACL,
-
[29]
Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024. 7, 27
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024. 7, 27
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” inNeurIPS, 2023. 7, 27
2023
-
[32]
VLMEvalKit: An open-source toolkit for evaluating large multi-modality models,
H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “VLMEvalKit: An open-source toolkit for evaluating large multi-modality models,” inMM, 2024. 7, 27
2024
-
[33]
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
K. Zhang, B. Li, P. Zhang, F. Pu, J. A. Cahyono, K. Hu, S. Liu, Y . Zhang, J. Yang, C. Liet al., “Lmms-eval: Reality check on the evaluation of large multimodal models,”arXiv preprint arXiv:2407.12772, 2024. 7, 27
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
VizWiz grand challenge: Answering visual questions from blind people,
D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “VizWiz grand challenge: Answering visual questions from blind people,” inCVPR, 2018. 7, 27
2018
-
[35]
GQA: A new dataset for real-world visual reasoning and compositional question answering,
D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, 2019. 7, 27
2019
-
[36]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024. 7, 27
2024
-
[37]
Measuring multimodal mathematical reasoning with math-vision dataset,
K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with math-vision dataset,” inNeurIPS, 2024. 7, 27
2024
-
[38]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022. 7, 27
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Towards vqa models that can read,
A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, 2019. 7, 27
2019
-
[40]
OCRVQA: Visual question answering by reading text in images,
A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “OCRVQA: Visual question answering by reading text in images,” inICDAR, 2019. 7, 27
2019
-
[41]
Referitgame: Referring to objects in photographs of natural scenes,
S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” inEMNLP, 2014. 7, 27
2014
-
[42]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in CVPR, 2024. 7, 27
2024
-
[43]
Docvqa: A dataset for vqa on document images,
M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” inWACV, 2021. 7, 27
2021
-
[44]
Learn to explain: Multimodal reasoning via thought chains for science question answering,
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inNeurIPS, 2022. 7, 27
2022
-
[45]
A diagram is worth a dozen images,
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi, “A diagram is worth a dozen images,” inECCV, 2016. 7, 27
2016
-
[46]
InfographicVQA,
M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V . Jawahar, “InfographicVQA,” inWACV, 2022. 7, 27
2022
-
[47]
Learning to answer questions in dynamic audio-visual scenarios,
G. Li, Y . Wei, Y . Tian, C. Xu, J.-R. Wen, and D. Hu, “Learning to answer questions in dynamic audio-visual scenarios,” inCVPR, 2022. 7, 27
2022
-
[48]
A VQA: A dataset for audio-visual question answering on videos,
P. Yang, X. Wang, X. Duan, H. Chen, R. Hou, C. Jin, and W. Zhu, “A VQA: A dataset for audio-visual question answering on videos,” in MM, 2022. 7, 27
2022
-
[49]
Fusionbench: A comprehensive benchmark of deep model fusion,
A. Tang, L. Shen, Y . Luo, H. Hu, B. Du, and D. Tao, “Fusionbench: A comprehensive benchmark of deep model fusion,”arXiv preprint arXiv:2406.03280, 2024. 7, 22
-
[50]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021. 8, 27
2021
-
[51]
GLUE: A multi-task benchmark and analysis platform for natural language understanding,
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inICLR, 2019. 8
2019
-
[52]
Scaling instruction-finetuned language models,
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,”JMLR, 2024. 8, 10, 21
2024
-
[53]
Mergebench: A benchmark for merging domain-specialized llms,
Y . He, S. Zeng, Y . Hu, R. Yang, T. Zhang, and H. Zhao, “Mergebench: A benchmark for merging domain-specialized llms,” inNeurIPS, 2025. 8, 10
2025
-
[54]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[56]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation,
J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation,” inNeurIPS, 2023. 8
2023
-
[58]
Instruction-Following Evaluation for Large Language Models
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,” arXiv preprint arXiv:2311.07911, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
TruthfulQA: Measuring how models mimic human falsehoods,
S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inACL, 2022. 8
2022
-
[60]
Measuring massive multitask language understanding,
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” inICLR, 2021. 8
2021
-
[61]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try ARC, the AI2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018. 8
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[62]
HellaSwag: Can a machine really finish your sentence?
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a machine really finish your sentence?” inACL, 2019. 8
2019
-
[63]
H. W. Engl, M. Hanke, and A. Neubauer,Regularization of Inverse Problems, ser. Mathematics and Its Applications. Dordrecht, The Netherlands: Kluwer Academic Publishers, 1996. 17 13
1996
-
[64]
Task arithmetic in the tangent space: Improved editing of pre-trained models,
G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,” inNeurIPS,
-
[65]
SGD: General analysis and improved rates,
R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richt´arik, “SGD: General analysis and improved rates,” inICML,
-
[66]
Better theory for SGD in the nonconvex world,
A. Khaled and P. Richt ´arik, “Better theory for SGD in the nonconvex world,”TMLR, 2023. 20
2023
-
[67]
MAP: Low-compute model merging with amortized pareto fronts via quadratic approximation,
L. Li, T. Zhang, Z. Bu, S. Wang, H. He, J. Fu, Y . Wu, J. Bian, Y . Chen, and Y . Bengio, “MAP: Low-compute model merging with amortized pareto fronts via quadratic approximation,” inICLR, 2025. 21
2025
-
[68]
What happens during finetuning of vision transformers: An invariance based investigation,
G. Merlin, V . Nanda, R. Rawal, and M. Toneva, “What happens during finetuning of vision transformers: An invariance based investigation,” inCoLLAs, 2023. 21
2023
-
[69]
π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation,
C. Wu, T. Wang, Y . Ge, Z. Lu, R. Zhou, Y . Shan, and P. Luo, “π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation,” inICML, 2023. 21, 24
2023
-
[70]
MMBench: Is your multi-modal model an all- around player?
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liuet al., “MMBench: Is your multi-modal model an all- around player?” inECCV, 2024. 26
2024
-
[71]
Seed- bench: Benchmarking multimodal large language models,
B. Li, Y . Ge, Y . Ge, G. Wang, R. Wang, R. Zhang, and Y . Shan, “Seed- bench: Benchmarking multimodal large language models,” inCVPR,
-
[72]
MME: A comprehensive evaluation benchmark for multimodal large language models,
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “MME: A comprehensive evaluation benchmark for multimodal large language models,” inNeurIPS, 2025. 26
2025
-
[73]
Are we on the right way for evaluating large vision-language models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, and F. Zhao, “Are we on the right way for evaluating large vision-language models?” inNeurIPS, 2024. 26
2024
-
[74]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering,
Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inCVPR, 2017. 27
2017
-
[75]
OK-VQA: A visual question answering benchmark requiring external knowledge,
K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “OK-VQA: A visual question answering benchmark requiring external knowledge,” in CVPR, 2019. 27
2019
-
[76]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inCVPR, 2024. 27
2024
-
[77]
CogVLM: Visual expert for pretrained language models,
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, S. XiXuanet al., “CogVLM: Visual expert for pretrained language models,” inNeurIPS, 2024. 27
2024
-
[78]
An augmented benchmark dataset for geometric question answering through dual parallel text encoding,
J. Cao and J. Xiao, “An augmented benchmark dataset for geometric question answering through dual parallel text encoding,” inCOLING,
-
[79]
arXiv preprint arXiv:2312.11370 (2023)
J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y . Wang, L. Hong, J. Han, H. Xu, Z. Liet al., “G-LLaV A: Solving geometric problem with multi- modal large language model,”arXiv preprint arXiv:2312.11370, 2023. 27
-
[80]
DVQA: Understanding data visualizations via question answering,
K. Kafle, B. Price, S. Cohen, and C. Kanan, “DVQA: Understanding data visualizations via question answering,” inCVPR, 2018. 27
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.