pith. sign in

arxiv: 2602.01167 · v1 · pith:R3K7NMIYnew · submitted 2026-02-01 · 💻 cs.AI

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

Pith reviewed 2026-05-21 14:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords vision-language modelstask-interfering layerslayer interventiontest-time adaptationTaLomultimodal tasksmodel modularityScienceQA
0
0 comments X

The pith

Certain layers in vision-language models interfere with specific tasks, and bypassing them improves performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pretrained vision-language models engage all layers by default when performing downstream tasks. Intervening on single layers by zeroing their parameters can lead to better results on some tasks, revealing that not all layers contribute positively. The study identifies task-interfering layers through systematic intervention and introduces the Task-Layer Interaction Vector to measure each layer's effect on a task. Tasks with similar requirements show similar patterns in how layers affect them. To leverage this, the authors create TaLo, a method that automatically selects and bypasses the most interfering layer for any given task at test time.

Core claim

In pretrained VLMs, some layers act as task-interfering layers that reduce performance on downstream tasks. By measuring performance changes after intervening on each layer, the authors find consistent improvements when certain layers are bypassed. These interfering layers display task-specific patterns, with similar tasks showing high similarity in their task-layer interaction vectors. TaLo uses this to dynamically knock out the most interfering layer without training, achieving gains such as 16.6% on the Maps task in ScienceQA using Qwen-VL.

What carries the argument

Task-Layer Interaction Vector, which quantifies the impact of intervening on each layer for a particular task by tracking performance changes.

Load-bearing premise

Zeroing out a layer's parameters accurately isolates its interfering effect without causing other unintended changes in how the model computes outputs.

What would settle it

An experiment where randomly bypassing layers yields similar or better improvements than targeting the identified interfering ones, or where zeroing layers fails to improve performance when confounding factors are controlled.

Figures

Figures reproduced from arXiv: 2602.01167 by Lei Feng, Shuo Yang, Weili Guan, Xiaobo Xia, Xiu Su, Yujie Wei, Zeke Xie, Zhiming Liu.

Figure 1
Figure 1. Figure 1: Overview of the task-interfering layer phenomenon. Each axis corresponds to a task category: AR (Attribute Reasoning), RR (Relation Reasoning), LR (Logical Reasoning), CP (Coarse Perception), FP-S (Fine-grained Perception [single-instance]), and FP-C (Fine-grained Perception [cross-instance]). Each plot shows model performance after zeroing out a single layer (solid curves), with the orange dashed line ind… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical Validation of the Task-Interfering Layers. (a) Visualization of the percentage change in accuracy across tasks after zeroing each layer on LLaVA-Next-LLaMA3-8B. Red indicates performance improvements relative to the base model, while blue indicates degradation. Many tasks show performance gains under layer interventions, indicating that interfering layers are commonly exist in VLMs. (b) The t-SNE… view at source ↗
Figure 3
Figure 3. Figure 3: Framework of TaLo. TaLo first dynamically selects the Task-Interfering layer for a specific task and knocks out that layer in the final evaluation procedure. where each element v (T ) i , referred to as the layer sensitiv￾ity score, quantifies the change in task performance upon intervention at layer i. Formally, it is defined as v (T ) i = Acc  M(i) intv, T  − Acc (Mbase, T ). (2) Here, Acc(·, T ) denot… view at source ↗
Figure 4
Figure 4. Figure 4: Consistency analysis of different interventions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative case study on random noise intervention. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer Selection’s Robustness Analysis. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Layer Index 0 5 10 15 20 25 30 Selection Frequency Robustness of Layer Selection Distribution 10-shot 15-shot 20-shot (a) Analysis on Math task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Layer Index 0 5 10 15 20 25 30 Selection Frequency R… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Case Studies Illustrating the Effects of Layer Zeroing on LLaVA-Next’s Reasoning. The figure* presents three [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy change heatmaps on MMBench (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy change heatmap for LLaVA-Next on MathVista-MINI (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy change heatmap for LLaVA-Next on SEEDBench (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy change heatmap for LLaVA-Next on ScienceQA (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy change heatmaps on MMStar (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Accuracy change heatmaps on MMMU (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Accuracy change heatmaps on MMBench (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Accuracy change heatmap for LLaVA-Next on MathVista-MINI (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Accuracy change heatmap for LLaVA-Next on SEEDBench (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Accuracy change heatmap for LLaVA-Next on ScienceQA (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Accuracy change heatmaps on MMStar (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Accuracy change heatmaps on MMMU (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
read the original abstract

Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that pretrained vision-language models contain task-interfering layers whose removal via intervention (e.g., zeroing parameters) can improve downstream task performance. It introduces the Task-Layer Interaction Vector to quantify per-layer effects, observes consistent patterns across similar tasks, and proposes the training-free TaLo method that dynamically bypasses the most interfering layer at test time, reporting gains such as 16.6% on the ScienceQA Maps task for Qwen-VL.

Significance. If the layer-intervention results prove robust to alternative bypass mechanisms, the work would usefully demonstrate unexpected modularity in VLMs and supply a simple plug-and-play inference-time adaptation technique. The direct empirical measurements and cross-model/dataset observations are strengths; the introduction of the interaction vector provides a concrete, falsifiable way to characterize layer-task relationships.

major comments (2)
  1. [Layer intervention experiments (Section 3)] The identification of task-interfering layers rests on zeroing layer parameters as the primary intervention. This proxy can rescale downstream activations, interact with layer norms, and alter residual dynamics in ways that differ from a true layer bypass (e.g., a skip connection or attention masking). Without a side-by-side comparison of zeroing versus explicit bypass on the same layers and tasks, the performance gains (including the 16.6% figure) cannot be confidently attributed to removal of interference rather than intervention side-effects.
  2. [Results and TaLo evaluation (Section 4)] The generalizability claim across models and datasets requires fuller reporting of all layer-intervention outcomes, including negative or neutral results, together with statistical controls for multiple testing. Selective highlighting of improvements risks overstating the prevalence and reliability of task-interfering layers.
minor comments (2)
  1. [Method (Section 3.1)] The definition and exact computation of the Task-Layer Interaction Vector should be stated with an equation or pseudocode to allow replication.
  2. [Figures 4-5] Figure captions and axis labels for the similarity matrices of task-layer vectors need clearer annotation to make the claimed high similarity between related tasks immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and valuable suggestions. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Layer intervention experiments (Section 3)] The identification of task-interfering layers rests on zeroing layer parameters as the primary intervention. This proxy can rescale downstream activations, interact with layer norms, and alter residual dynamics in ways that differ from a true layer bypass (e.g., a skip connection or attention masking). Without a side-by-side comparison of zeroing versus explicit bypass on the same layers and tasks, the performance gains (including the 16.6% figure) cannot be confidently attributed to removal of interference rather than intervention side-effects.

    Authors: We appreciate this important distinction between zeroing and a pure architectural bypass. Zeroing was chosen as a direct ablation to nullify a layer's contribution, following standard practices in neural network interpretability. We acknowledge that side effects on norms and residuals could contribute to observed gains. To isolate the effect, we will add side-by-side experiments in the revised manuscript comparing zeroing against explicit skip connections and attention masking on the same layers and tasks, including the ScienceQA Maps example. revision: yes

  2. Referee: [Results and TaLo evaluation (Section 4)] The generalizability claim across models and datasets requires fuller reporting of all layer-intervention outcomes, including negative or neutral results, together with statistical controls for multiple testing. Selective highlighting of improvements risks overstating the prevalence and reliability of task-interfering layers.

    Authors: We agree that complete reporting strengthens the claims. The manuscript already covers multiple models and datasets with some neutral outcomes noted, but we will expand Section 4 and the appendix to include a full table of all layer-intervention results (positive, neutral, and negative) across experiments. We will also add statistical controls such as corrected p-values or confidence intervals to account for multiple testing. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical interventions and direct measurements

full rationale

The paper's core claims rest on direct empirical measurements: zeroing individual layer parameters, recording performance deltas on downstream tasks, and defining the Task-Layer Interaction Vector from those observed changes. TaLo is then a simple selection rule that picks the layer with the largest negative delta for a given task. No equations reduce a claimed prediction to a fitted input by construction, no self-citations bear the central premise, and no uniqueness theorems or ansatzes are imported from prior author work. The results are presented as falsifiable experimental outcomes across multiple models and datasets rather than derived quantities, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's claims rest on the empirical validity of layer interventions as indicators of interference, with no free parameters explicitly fitted but potential implicit choices in layer selection and intervention type.

axioms (1)
  • domain assumption Intervening on a layer by zeroing its parameters reveals whether that layer helps or interferes with a given task.
    This premise underpins the identification of task-interfering layers and the design of TaLo.
invented entities (1)
  • Task-Layer Interaction Vector no independent evidence
    purpose: Quantifies the effect of layer interventions on task performance.
    Defined based on performance changes from interventions.

pith-pipeline@v0.9.0 · 5845 in / 1225 out tokens · 66137 ms · 2026-05-21T14:43:19.148062+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    De- tecting and pruning prominent but detrimental neurons in large language models

    Ameen Ali Ali, Shahar Katz, Lior Wolf, and Ivan Titov. De- tecting and pruning prominent but detrimental neurons in large language models. InProceedings of the Second Con- ference on Language Modeling, 2025. 3

  2. [2]

    Data- efficient learning via minimizing hyperspherical energy

    Xiaofeng Cao, Weiyang Liu, and Ivor W Tsang. Data- efficient learning via minimizing hyperspherical energy. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 45(11):13422–13437, 2023. 6

  3. [3]

    Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024

    Xiaofeng Cao, Yaming Guo, Heng Tao Shen, Ivor W Tsang, and James T Kwok. Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024. 6

  4. [4]

    Are we on the right way for evaluating large vision-language models? InNeurIPS 2024,

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InNeurIPS 2024, . 6, 7, 2

  5. [5]

    Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit

    Qizhou Chen, Taolin Zhang, Chengyu Wang, Xiaofeng He, Dakan Wang, and Tingting Liu. Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit. InAAAI-25, pages 2168– 2176, . 3

  6. [6]

    Bring reason to vision: Understanding perception and reasoning through model merging, 2025

    Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging, 2025. 2, 3, 7

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 6, 2

  8. [8]

    A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans

    Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):10558–10578, 2024. 3

  9. [9]

    Knowledge neurons in pretrained transform- ers

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transform- ers. InACL,2022, pages 8493–8502. 3

  10. [10]

    Editing factual knowledge in language models

    Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, 2021. 3

  11. [11]

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 1

  12. [12]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM,2024, pages 11198–11201, 2024. 7

  13. [13]

    Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024

    Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Mahesh- wary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024. 3

  14. [14]

    Diverse data augmentation with diffusions for effective test-time prompt tuning

    Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InICCV 2023, pages 2704–2714. 3

  15. [15]

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR

  16. [16]

    Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes

    Ziang Guo, Zakhar Yagudin, Artem Lykov, Mikhail Ko- nenkov, and Dzmitry Tsetserukou. Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes. In2nd Interna- tional Conference on Foundation and Large Language Mod- els, FLLM 2024, pages 501–507. 1

  17. [17]

    Channel pruning for accelerating very deep neural networks

    Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. InICCV 2017, pages 1398–1406. 3

  18. [18]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022. 7

  19. [19]

    Lan- guage is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Johan Bertil Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Lan- guage is not all you need: Aligning perception with language models. InNeurIPS 2023. 1

  20. [20]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco T ´ulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh In- ternational Conference on Learning Representations, ICLR

  21. [21]

    Test-time classifier ad- justment module for model-agnostic domain generalization

    Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier ad- justment module for model-agnostic domain generalization. InNeurIPS, 2021. 3

  22. [22]

    Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El-Saddik, and Eric P. Xing. Efficient test-time adaptation of vision-language models. InCVPR 2024, pages 14162– 14171. 3

  23. [23]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 6, 2

  24. [24]

    Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 2, 4

  25. [25]

    Llava-med: Training a large language- and-vision assistant for biomedicine in one day

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, 9 and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InNeurIPS

  26. [26]

    Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025. 1

  27. [27]

    Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233. 6, 2

  28. [28]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

  29. [29]

    Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 2

  30. [30]

    An image enhancing pattern-based sparsity for real-time inference on mobile de- vices

    Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. An image enhancing pattern-based sparsity for real-time inference on mobile de- vices. InECCV, 2020, pages 629–645. 3

  31. [31]

    Shortgpt: Layers in large language mod- els are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingn- ing Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language mod- els are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, pages 20192–20204, 2025. 1

  32. [32]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In NeurIPS 2022, . 3

  33. [33]

    Andonian, Yonatan Belinkov, and David Bau

    Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, . 3

  34. [34]

    Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, . 3

  35. [35]

    Manning, and Chelsea Finn

    Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. Memory-based model editing at scale. InInternational Conference on Machine Learning, ICML 2022, pages 15817–15831, . 3

  36. [36]

    Compact language models via pruning and knowledge distil- lation

    Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation. InNeurIPS, pages 41076–41102, 2024. 3

  37. [37]

    Controlling text-to-image diffusion by orthogo- nal finetuning

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch¨olkopf. Controlling text-to-image diffusion by orthogo- nal finetuning. InNeurIPS 2023. 7

  38. [38]

    Improving robustness against common corruptions by covariate shift adaptation

    Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. InNeurIPS 2020. 3

  39. [39]

    Test- time prompt tuning for zero-shot generalization in vision- language models

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. InNeurIPS 2022. 3

  40. [40]

    A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024

    Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov. A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024. 3

  41. [41]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV 2024, pages 256–

  42. [42]

    Transformer- squared: Self-adaptive llms

    Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer- squared: Self-adaptive llms. InThe 13th International Con- ference on Learning Representations, ICLR 2025, . 6

  43. [43]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025. 1

  44. [44]

    Efros, and Moritz Hardt

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Ma- chine Learning, ICML 2020, pages 9229–9248, . 3

  45. [45]

    Docllm: A layout-aware genera- tive language model for multimodal document understand- ing

    Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nour- bakhsh, and Xiaomo Liu. Docllm: A layout-aware genera- tive language model for multimodal document understand- ing. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics, ACL 2024, pages 8529–8548, . 1

  46. [46]

    Ol- shausen, and Trevor Darrell

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In9th International Conference on Learning Representations, ICLR 2021, . 3

  47. [47]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 6

  48. [48]

    Skipnet: Learning dynamic routing in convolutional networks

    Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. InECCV, pages 409–424, 2018. 6

  49. [49]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.CoRR, abs/2408.07666, 2024. 7

  50. [50]

    Medical large vision language models with multi-image visual ability

    Xikai Yang, Juzheng Miao, Yuchen Yuan, Jiaze Wang, Qi Dou, Jinpeng Li, and Pheng-Ann Heng. Medical large vision language models with multi-image visual ability. InMedi- cal Image Computing and Computer Assisted Intervention - MICCAI 2025, pages 402–412. 1 10

  51. [51]

    Chenyu Yi, Siyuan Yang, Yufei Wang, Haoliang Li, Yap- Peng Tan, and Alex C. Kot. Temporal coherent test time op- timization for robust video classification. InThe Eleventh In- ternational Conference on Learning Representations, ICLR 2023, 2023. 3

  52. [52]

    Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity

    Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Kumar Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity. InForty-first International Conference on Machine Learning, ICML 2024. 3, 1

  53. [53]

    A survey on multimodal large language models.National Science Review, 11(12), 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), 2024. 1

  54. [54]

    Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

  55. [55]

    MEMO: test time robustness via adaptation and augmentation

    Marvin Zhang, Sergey Levine, and Chelsea Finn. MEMO: test time robustness via adaptation and augmentation. In NeurIPS 2022. 3

  56. [56]

    Investigat- ing layer importance in large language models

    Yang Zhang, Yanfei Dong, and Kenji Kawaguchi. Investigat- ing layer importance in large language models. InProceed- ings of the 7th BlackboxNLP Workshop: Analyzing and In- terpreting Neural Networks for NLP, pages 469–479, 2024. 2, 3

  57. [57]

    Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025

    Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhi- wei Fei, Hui Su, and Xiaoyu Shen. Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025. 6

  58. [58]

    Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models

    Kai Zheng, Wei Wu, Rui Feng, and et al. Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models. InProceedings of CVPR 2023, pages 11663–11673. 3

  59. [59]

    Modifying memories in transformer models, 2020

    Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix X. Yu, and Sanjiv Ku- mar. Modifying memories in transformer models.CoRR, abs/2012.00363, 2020. 3 11 Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models Supplementary Material A. Models and Benchmarks We present all the mo...