Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

Lei Feng; Shuo Yang; Weili Guan; Xiaobo Xia; Xiu Su; Yujie Wei; Zeke Xie; Zhiming Liu

arxiv: 2602.01167 · v1 · pith:R3K7NMIYnew · submitted 2026-02-01 · 💻 cs.AI

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

Zhiming Liu , Yujie Wei , Lei Feng , Xiu Su , Xiaobo Xia , Weili Guan , Zeke Xie , Shuo Yang This is my paper

Pith reviewed 2026-05-21 14:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords vision-language modelstask-interfering layerslayer interventiontest-time adaptationTaLomultimodal tasksmodel modularityScienceQA

0 comments

The pith

Certain layers in vision-language models interfere with specific tasks, and bypassing them improves performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pretrained vision-language models engage all layers by default when performing downstream tasks. Intervening on single layers by zeroing their parameters can lead to better results on some tasks, revealing that not all layers contribute positively. The study identifies task-interfering layers through systematic intervention and introduces the Task-Layer Interaction Vector to measure each layer's effect on a task. Tasks with similar requirements show similar patterns in how layers affect them. To leverage this, the authors create TaLo, a method that automatically selects and bypasses the most interfering layer for any given task at test time.

Core claim

In pretrained VLMs, some layers act as task-interfering layers that reduce performance on downstream tasks. By measuring performance changes after intervening on each layer, the authors find consistent improvements when certain layers are bypassed. These interfering layers display task-specific patterns, with similar tasks showing high similarity in their task-layer interaction vectors. TaLo uses this to dynamically knock out the most interfering layer without training, achieving gains such as 16.6% on the Maps task in ScienceQA using Qwen-VL.

What carries the argument

Task-Layer Interaction Vector, which quantifies the impact of intervening on each layer for a particular task by tracking performance changes.

Load-bearing premise

Zeroing out a layer's parameters accurately isolates its interfering effect without causing other unintended changes in how the model computes outputs.

What would settle it

An experiment where randomly bypassing layers yields similar or better improvements than targeting the identified interfering ones, or where zeroing layers fails to improve performance when confounding factors are controlled.

Figures

Figures reproduced from arXiv: 2602.01167 by Lei Feng, Shuo Yang, Weili Guan, Xiaobo Xia, Xiu Su, Yujie Wei, Zeke Xie, Zhiming Liu.

**Figure 1.** Figure 1: Overview of the task-interfering layer phenomenon. Each axis corresponds to a task category: AR (Attribute Reasoning), RR (Relation Reasoning), LR (Logical Reasoning), CP (Coarse Perception), FP-S (Fine-grained Perception [single-instance]), and FP-C (Fine-grained Perception [cross-instance]). Each plot shows model performance after zeroing out a single layer (solid curves), with the orange dashed line ind… view at source ↗

**Figure 2.** Figure 2: Empirical Validation of the Task-Interfering Layers. (a) Visualization of the percentage change in accuracy across tasks after zeroing each layer on LLaVA-Next-LLaMA3-8B. Red indicates performance improvements relative to the base model, while blue indicates degradation. Many tasks show performance gains under layer interventions, indicating that interfering layers are commonly exist in VLMs. (b) The t-SNE… view at source ↗

**Figure 3.** Figure 3: Framework of TaLo. TaLo first dynamically selects the Task-Interfering layer for a specific task and knocks out that layer in the final evaluation procedure. where each element v (T ) i , referred to as the layer sensitivity score, quantifies the change in task performance upon intervention at layer i. Formally, it is defined as v (T ) i = Acc M(i) intv, T − Acc (Mbase, T ). (2) Here, Acc(·, T ) denot… view at source ↗

**Figure 4.** Figure 4: Consistency analysis of different interventions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative case study on random noise intervention. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 7.** Figure 7: Layer Selection’s Robustness Analysis. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Layer Index 0 5 10 15 20 25 30 Selection Frequency Robustness of Layer Selection Distribution 10-shot 15-shot 20-shot (a) Analysis on Math task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Layer Index 0 5 10 15 20 25 30 Selection Frequency R… view at source ↗

**Figure 8.** Figure 8: Qualitative Case Studies Illustrating the Effects of Layer Zeroing on LLaVA-Next’s Reasoning. The figure* presents three [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy change heatmaps on MMBench (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Accuracy change heatmap for LLaVA-Next on MathVista-MINI (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy change heatmap for LLaVA-Next on SEEDBench (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy change heatmap for LLaVA-Next on ScienceQA (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracy change heatmaps on MMStar (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Accuracy change heatmaps on MMMU (Uniform Scaling). [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Accuracy change heatmaps on MMBench (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Accuracy change heatmap for LLaVA-Next on MathVista-MINI (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Accuracy change heatmap for LLaVA-Next on SEEDBench (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Accuracy change heatmap for LLaVA-Next on ScienceQA (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Accuracy change heatmaps on MMStar (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Accuracy change heatmaps on MMMU (Zeroing). [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

read the original abstract

Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Some layers in VLMs hurt specific tasks, and skipping the worst one at test time gives measurable gains, but zeroing weights may not cleanly prove interference.

read the letter

The main point is that intervening on single layers in pretrained VLMs can raise accuracy on downstream tasks, and the authors turn that observation into a training-free adaptation rule called TaLo that picks which layer to bypass per task. They report gains as large as 16.6% on the Maps subset of ScienceQA for Qwen-VL, and they show that tasks with similar demands produce similar patterns of layer sensitivity. That is the concrete new piece: a systematic measurement of per-layer performance deltas across multiple models and datasets, plus the Task-Layer Interaction Vector as a compact way to summarize those effects. The work is useful because it stays empirical and directly measures what happens when you zero a layer rather than relying on gradient-based attributions or post-hoc explanations. The patterns they find across tasks also line up with intuition about shared capabilities, which adds some face validity. The soft spot is the intervention method itself. Zeroing parameters changes activation magnitudes and residual flow in ways a true skip connection would not, so the performance lift could come from those side effects rather than from removing an actively harmful layer. The abstract gives little detail on controls for that confound or on how they correct for multiple tests across layers and tasks, which leaves the generalizability claim thinner than it needs to be. Readers working on test-time adaptation or internal model analysis will get the most from this. It is worth sending to referees because the core observation is straightforward to check and the proposed method is simple enough to reproduce quickly, even if the causal story needs more scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that pretrained vision-language models contain task-interfering layers whose removal via intervention (e.g., zeroing parameters) can improve downstream task performance. It introduces the Task-Layer Interaction Vector to quantify per-layer effects, observes consistent patterns across similar tasks, and proposes the training-free TaLo method that dynamically bypasses the most interfering layer at test time, reporting gains such as 16.6% on the ScienceQA Maps task for Qwen-VL.

Significance. If the layer-intervention results prove robust to alternative bypass mechanisms, the work would usefully demonstrate unexpected modularity in VLMs and supply a simple plug-and-play inference-time adaptation technique. The direct empirical measurements and cross-model/dataset observations are strengths; the introduction of the interaction vector provides a concrete, falsifiable way to characterize layer-task relationships.

major comments (2)

[Layer intervention experiments (Section 3)] The identification of task-interfering layers rests on zeroing layer parameters as the primary intervention. This proxy can rescale downstream activations, interact with layer norms, and alter residual dynamics in ways that differ from a true layer bypass (e.g., a skip connection or attention masking). Without a side-by-side comparison of zeroing versus explicit bypass on the same layers and tasks, the performance gains (including the 16.6% figure) cannot be confidently attributed to removal of interference rather than intervention side-effects.
[Results and TaLo evaluation (Section 4)] The generalizability claim across models and datasets requires fuller reporting of all layer-intervention outcomes, including negative or neutral results, together with statistical controls for multiple testing. Selective highlighting of improvements risks overstating the prevalence and reliability of task-interfering layers.

minor comments (2)

[Method (Section 3.1)] The definition and exact computation of the Task-Layer Interaction Vector should be stated with an equation or pseudocode to allow replication.
[Figures 4-5] Figure captions and axis labels for the similarity matrices of task-layer vectors need clearer annotation to make the claimed high similarity between related tasks immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and valuable suggestions. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Layer intervention experiments (Section 3)] The identification of task-interfering layers rests on zeroing layer parameters as the primary intervention. This proxy can rescale downstream activations, interact with layer norms, and alter residual dynamics in ways that differ from a true layer bypass (e.g., a skip connection or attention masking). Without a side-by-side comparison of zeroing versus explicit bypass on the same layers and tasks, the performance gains (including the 16.6% figure) cannot be confidently attributed to removal of interference rather than intervention side-effects.

Authors: We appreciate this important distinction between zeroing and a pure architectural bypass. Zeroing was chosen as a direct ablation to nullify a layer's contribution, following standard practices in neural network interpretability. We acknowledge that side effects on norms and residuals could contribute to observed gains. To isolate the effect, we will add side-by-side experiments in the revised manuscript comparing zeroing against explicit skip connections and attention masking on the same layers and tasks, including the ScienceQA Maps example. revision: yes
Referee: [Results and TaLo evaluation (Section 4)] The generalizability claim across models and datasets requires fuller reporting of all layer-intervention outcomes, including negative or neutral results, together with statistical controls for multiple testing. Selective highlighting of improvements risks overstating the prevalence and reliability of task-interfering layers.

Authors: We agree that complete reporting strengthens the claims. The manuscript already covers multiple models and datasets with some neutral outcomes noted, but we will expand Section 4 and the appendix to include a full table of all layer-intervention results (positive, neutral, and negative) across experiments. We will also add statistical controls such as corrected p-values or confidence intervals to account for multiple testing. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical interventions and direct measurements

full rationale

The paper's core claims rest on direct empirical measurements: zeroing individual layer parameters, recording performance deltas on downstream tasks, and defining the Task-Layer Interaction Vector from those observed changes. TaLo is then a simple selection rule that picks the layer with the largest negative delta for a given task. No equations reduce a claimed prediction to a fitted input by construction, no self-citations bear the central premise, and no uniqueness theorems or ansatzes are imported from prior author work. The results are presented as falsifiable experimental outcomes across multiple models and datasets rather than derived quantities, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's claims rest on the empirical validity of layer interventions as indicators of interference, with no free parameters explicitly fitted but potential implicit choices in layer selection and intervention type.

axioms (1)

domain assumption Intervening on a layer by zeroing its parameters reveals whether that layer helps or interferes with a given task.
This premise underpins the identification of task-interfering layers and the design of TaLo.

invented entities (1)

Task-Layer Interaction Vector no independent evidence
purpose: Quantifies the effect of layer interventions on task performance.
Defined based on performance changes from interventions.

pith-pipeline@v0.9.0 · 5845 in / 1225 out tokens · 66137 ms · 2026-05-21T14:43:19.148062+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

De- tecting and pruning prominent but detrimental neurons in large language models

Ameen Ali Ali, Shahar Katz, Lior Wolf, and Ivan Titov. De- tecting and pruning prominent but detrimental neurons in large language models. InProceedings of the Second Con- ference on Language Modeling, 2025. 3

work page 2025
[2]

Data- efficient learning via minimizing hyperspherical energy

Xiaofeng Cao, Weiyang Liu, and Ivor W Tsang. Data- efficient learning via minimizing hyperspherical energy. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 45(11):13422–13437, 2023. 6

work page 2023
[3]

Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024

Xiaofeng Cao, Yaming Guo, Heng Tao Shen, Ivor W Tsang, and James T Kwok. Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024. 6

work page 2024
[4]

Are we on the right way for evaluating large vision-language models? InNeurIPS 2024,

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InNeurIPS 2024, . 6, 7, 2

work page 2024
[5]

Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit

Qizhou Chen, Taolin Zhang, Chengyu Wang, Xiaofeng He, Dakan Wang, and Tingting Liu. Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit. InAAAI-25, pages 2168– 2176, . 3

work page
[6]

Bring reason to vision: Understanding perception and reasoning through model merging, 2025

Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging, 2025. 2, 3, 7

work page 2025
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 6, 2

work page 2024
[8]

A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans

Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):10558–10578, 2024. 3

work page 2024
[9]

Knowledge neurons in pretrained transform- ers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transform- ers. InACL,2022, pages 8493–8502. 3

work page 2022
[10]

Editing factual knowledge in language models

Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, 2021. 3

work page 2021
[11]

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 1

work page 2025
[12]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM,2024, pages 11198–11201, 2024. 7

work page 2024
[13]

Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024

Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Mahesh- wary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024. 3

work page arXiv 2024
[14]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InICCV 2023, pages 2704–2714. 3

work page 2023
[15]

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR

work page
[16]

Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes

Ziang Guo, Zakhar Yagudin, Artem Lykov, Mikhail Ko- nenkov, and Dzmitry Tsetserukou. Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes. In2nd Interna- tional Conference on Foundation and Large Language Mod- els, FLLM 2024, pages 501–507. 1

work page 2024
[17]

Channel pruning for accelerating very deep neural networks

Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. InICCV 2017, pages 1398–1406. 3

work page 2017
[18]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022. 7

work page 2022
[19]

Lan- guage is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Johan Bertil Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Lan- guage is not all you need: Aligning perception with language models. InNeurIPS 2023. 1

work page 2023
[20]

Editing models with task arithmetic

Gabriel Ilharco, Marco T ´ulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh In- ternational Conference on Learning Representations, ICLR

work page
[21]

Test-time classifier ad- justment module for model-agnostic domain generalization

Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier ad- justment module for model-agnostic domain generalization. InNeurIPS, 2021. 3

work page 2021
[22]

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El-Saddik, and Eric P. Xing. Efficient test-time adaptation of vision-language models. InCVPR 2024, pages 14162– 14171. 3

work page 2024
[23]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 2, 4

work page 2024
[25]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, 9 and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InNeurIPS

work page
[26]

Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025. 1

work page arXiv 2025
[27]

Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233. 6, 2

work page 2024
[28]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

work page
[29]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 2

work page 2024
[30]

An image enhancing pattern-based sparsity for real-time inference on mobile de- vices

Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. An image enhancing pattern-based sparsity for real-time inference on mobile de- vices. InECCV, 2020, pages 629–645. 3

work page 2020
[31]

Shortgpt: Layers in large language mod- els are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingn- ing Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language mod- els are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, pages 20192–20204, 2025. 1

work page 2025
[32]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In NeurIPS 2022, . 3

work page 2022
[33]

Andonian, Yonatan Belinkov, and David Bau

Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, . 3

work page 2023
[34]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, . 3

work page 2022
[35]

Manning, and Chelsea Finn

Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. Memory-based model editing at scale. InInternational Conference on Machine Learning, ICML 2022, pages 15817–15831, . 3

work page 2022
[36]

Compact language models via pruning and knowledge distil- lation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation. InNeurIPS, pages 41076–41102, 2024. 3

work page 2024
[37]

Controlling text-to-image diffusion by orthogo- nal finetuning

Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch¨olkopf. Controlling text-to-image diffusion by orthogo- nal finetuning. InNeurIPS 2023. 7

work page 2023
[38]

Improving robustness against common corruptions by covariate shift adaptation

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. InNeurIPS 2020. 3

work page 2020
[39]

Test- time prompt tuning for zero-shot generalization in vision- language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. InNeurIPS 2022. 3

work page 2022
[40]

A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024

Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov. A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024. 3

work page arXiv 2024
[41]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV 2024, pages 256–

work page 2024
[42]

Transformer- squared: Self-adaptive llms

Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer- squared: Self-adaptive llms. InThe 13th International Con- ference on Learning Representations, ICLR 2025, . 6

work page 2025
[43]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025. 1

work page arXiv 2025
[44]

Efros, and Moritz Hardt

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Ma- chine Learning, ICML 2020, pages 9229–9248, . 3

work page 2020
[45]

Docllm: A layout-aware genera- tive language model for multimodal document understand- ing

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nour- bakhsh, and Xiaomo Liu. Docllm: A layout-aware genera- tive language model for multimodal document understand- ing. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics, ACL 2024, pages 8529–8548, . 1

work page 2024
[46]

Ol- shausen, and Trevor Darrell

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In9th International Conference on Learning Representations, ICLR 2021, . 3

work page 2021
[47]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Skipnet: Learning dynamic routing in convolutional networks

Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. InECCV, pages 409–424, 2018. 6

work page 2018
[49]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.CoRR, abs/2408.07666, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Medical large vision language models with multi-image visual ability

Xikai Yang, Juzheng Miao, Yuchen Yuan, Jiaze Wang, Qi Dou, Jinpeng Li, and Pheng-Ann Heng. Medical large vision language models with multi-image visual ability. InMedi- cal Image Computing and Computer Assisted Intervention - MICCAI 2025, pages 402–412. 1 10

work page 2025
[51]

Chenyu Yi, Siyuan Yang, Yufei Wang, Haoliang Li, Yap- Peng Tan, and Alex C. Kot. Temporal coherent test time op- timization for robust video classification. InThe Eleventh In- ternational Conference on Learning Representations, ICLR 2023, 2023. 3

work page 2023
[52]

Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Kumar Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity. InForty-first International Conference on Machine Learning, ICML 2024. 3, 1

work page 2024
[53]

A survey on multimodal large language models.National Science Review, 11(12), 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), 2024. 1

work page 2024
[54]

Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page 2024
[55]

MEMO: test time robustness via adaptation and augmentation

Marvin Zhang, Sergey Levine, and Chelsea Finn. MEMO: test time robustness via adaptation and augmentation. In NeurIPS 2022. 3

work page 2022
[56]

Investigat- ing layer importance in large language models

Yang Zhang, Yanfei Dong, and Kenji Kawaguchi. Investigat- ing layer importance in large language models. InProceed- ings of the 7th BlackboxNLP Workshop: Analyzing and In- terpreting Neural Networks for NLP, pages 469–479, 2024. 2, 3

work page 2024
[57]

Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025

Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhi- wei Fei, Hui Su, and Xiaoyu Shen. Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025. 6

work page arXiv 2025
[58]

Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models

Kai Zheng, Wei Wu, Rui Feng, and et al. Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models. InProceedings of CVPR 2023, pages 11663–11673. 3

work page 2023
[59]

Modifying memories in transformer models, 2020

Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix X. Yu, and Sanjiv Ku- mar. Modifying memories in transformer models.CoRR, abs/2012.00363, 2020. 3 11 Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models Supplementary Material A. Models and Benchmarks We present all the mo...

work page arXiv 2012

[1] [1]

De- tecting and pruning prominent but detrimental neurons in large language models

Ameen Ali Ali, Shahar Katz, Lior Wolf, and Ivan Titov. De- tecting and pruning prominent but detrimental neurons in large language models. InProceedings of the Second Con- ference on Language Modeling, 2025. 3

work page 2025

[2] [2]

Data- efficient learning via minimizing hyperspherical energy

Xiaofeng Cao, Weiyang Liu, and Ivor W Tsang. Data- efficient learning via minimizing hyperspherical energy. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 45(11):13422–13437, 2023. 6

work page 2023

[3] [3]

Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024

Xiaofeng Cao, Yaming Guo, Heng Tao Shen, Ivor W Tsang, and James T Kwok. Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024. 6

work page 2024

[4] [4]

Are we on the right way for evaluating large vision-language models? InNeurIPS 2024,

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InNeurIPS 2024, . 6, 7, 2

work page 2024

[5] [5]

Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit

Qizhou Chen, Taolin Zhang, Chengyu Wang, Xiaofeng He, Dakan Wang, and Tingting Liu. Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit. InAAAI-25, pages 2168– 2176, . 3

work page

[6] [6]

Bring reason to vision: Understanding perception and reasoning through model merging, 2025

Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging, 2025. 2, 3, 7

work page 2025

[7] [7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 6, 2

work page 2024

[8] [8]

A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans

Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):10558–10578, 2024. 3

work page 2024

[9] [9]

Knowledge neurons in pretrained transform- ers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transform- ers. InACL,2022, pages 8493–8502. 3

work page 2022

[10] [10]

Editing factual knowledge in language models

Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, 2021. 3

work page 2021

[11] [11]

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 1

work page 2025

[12] [12]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM,2024, pages 11198–11201, 2024. 7

work page 2024

[13] [13]

Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024

Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Mahesh- wary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024. 3

work page arXiv 2024

[14] [14]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InICCV 2023, pages 2704–2714. 3

work page 2023

[15] [15]

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR

work page

[16] [16]

Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes

Ziang Guo, Zakhar Yagudin, Artem Lykov, Mikhail Ko- nenkov, and Dzmitry Tsetserukou. Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes. In2nd Interna- tional Conference on Foundation and Large Language Mod- els, FLLM 2024, pages 501–507. 1

work page 2024

[17] [17]

Channel pruning for accelerating very deep neural networks

Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. InICCV 2017, pages 1398–1406. 3

work page 2017

[18] [18]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022. 7

work page 2022

[19] [19]

Lan- guage is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Johan Bertil Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Lan- guage is not all you need: Aligning perception with language models. InNeurIPS 2023. 1

work page 2023

[20] [20]

Editing models with task arithmetic

Gabriel Ilharco, Marco T ´ulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh In- ternational Conference on Learning Representations, ICLR

work page

[21] [21]

Test-time classifier ad- justment module for model-agnostic domain generalization

Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier ad- justment module for model-agnostic domain generalization. InNeurIPS, 2021. 3

work page 2021

[22] [22]

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El-Saddik, and Eric P. Xing. Efficient test-time adaptation of vision-language models. InCVPR 2024, pages 14162– 14171. 3

work page 2024

[23] [23]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 2, 4

work page 2024

[25] [25]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, 9 and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InNeurIPS

work page

[26] [26]

Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025. 1

work page arXiv 2025

[27] [27]

Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233. 6, 2

work page 2024

[28] [28]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

work page

[29] [29]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 2

work page 2024

[30] [30]

An image enhancing pattern-based sparsity for real-time inference on mobile de- vices

Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. An image enhancing pattern-based sparsity for real-time inference on mobile de- vices. InECCV, 2020, pages 629–645. 3

work page 2020

[31] [31]

Shortgpt: Layers in large language mod- els are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingn- ing Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language mod- els are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, pages 20192–20204, 2025. 1

work page 2025

[32] [32]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In NeurIPS 2022, . 3

work page 2022

[33] [33]

Andonian, Yonatan Belinkov, and David Bau

Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, . 3

work page 2023

[34] [34]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, . 3

work page 2022

[35] [35]

Manning, and Chelsea Finn

Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. Memory-based model editing at scale. InInternational Conference on Machine Learning, ICML 2022, pages 15817–15831, . 3

work page 2022

[36] [36]

Compact language models via pruning and knowledge distil- lation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation. InNeurIPS, pages 41076–41102, 2024. 3

work page 2024

[37] [37]

Controlling text-to-image diffusion by orthogo- nal finetuning

Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch¨olkopf. Controlling text-to-image diffusion by orthogo- nal finetuning. InNeurIPS 2023. 7

work page 2023

[38] [38]

Improving robustness against common corruptions by covariate shift adaptation

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. InNeurIPS 2020. 3

work page 2020

[39] [39]

Test- time prompt tuning for zero-shot generalization in vision- language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. InNeurIPS 2022. 3

work page 2022

[40] [40]

A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024

Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov. A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024. 3

work page arXiv 2024

[41] [41]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV 2024, pages 256–

work page 2024

[42] [42]

Transformer- squared: Self-adaptive llms

Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer- squared: Self-adaptive llms. InThe 13th International Con- ference on Learning Representations, ICLR 2025, . 6

work page 2025

[43] [43]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025. 1

work page arXiv 2025

[44] [44]

Efros, and Moritz Hardt

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Ma- chine Learning, ICML 2020, pages 9229–9248, . 3

work page 2020

[45] [45]

Docllm: A layout-aware genera- tive language model for multimodal document understand- ing

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nour- bakhsh, and Xiaomo Liu. Docllm: A layout-aware genera- tive language model for multimodal document understand- ing. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics, ACL 2024, pages 8529–8548, . 1

work page 2024

[46] [46]

Ol- shausen, and Trevor Darrell

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In9th International Conference on Learning Representations, ICLR 2021, . 3

work page 2021

[47] [47]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Skipnet: Learning dynamic routing in convolutional networks

Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. InECCV, pages 409–424, 2018. 6

work page 2018

[49] [49]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.CoRR, abs/2408.07666, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Medical large vision language models with multi-image visual ability

Xikai Yang, Juzheng Miao, Yuchen Yuan, Jiaze Wang, Qi Dou, Jinpeng Li, and Pheng-Ann Heng. Medical large vision language models with multi-image visual ability. InMedi- cal Image Computing and Computer Assisted Intervention - MICCAI 2025, pages 402–412. 1 10

work page 2025

[51] [51]

Chenyu Yi, Siyuan Yang, Yufei Wang, Haoliang Li, Yap- Peng Tan, and Alex C. Kot. Temporal coherent test time op- timization for robust video classification. InThe Eleventh In- ternational Conference on Learning Representations, ICLR 2023, 2023. 3

work page 2023

[52] [52]

Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Kumar Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity. InForty-first International Conference on Machine Learning, ICML 2024. 3, 1

work page 2024

[53] [53]

A survey on multimodal large language models.National Science Review, 11(12), 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), 2024. 1

work page 2024

[54] [54]

Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page 2024

[55] [55]

MEMO: test time robustness via adaptation and augmentation

Marvin Zhang, Sergey Levine, and Chelsea Finn. MEMO: test time robustness via adaptation and augmentation. In NeurIPS 2022. 3

work page 2022

[56] [56]

Investigat- ing layer importance in large language models

Yang Zhang, Yanfei Dong, and Kenji Kawaguchi. Investigat- ing layer importance in large language models. InProceed- ings of the 7th BlackboxNLP Workshop: Analyzing and In- terpreting Neural Networks for NLP, pages 469–479, 2024. 2, 3

work page 2024

[57] [57]

Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025

Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhi- wei Fei, Hui Su, and Xiaoyu Shen. Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025. 6

work page arXiv 2025

[58] [58]

Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models

Kai Zheng, Wei Wu, Rui Feng, and et al. Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models. InProceedings of CVPR 2023, pages 11663–11673. 3

work page 2023

[59] [59]

Modifying memories in transformer models, 2020

Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix X. Yu, and Sanjiv Ku- mar. Modifying memories in transformer models.CoRR, abs/2012.00363, 2020. 3 11 Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models Supplementary Material A. Models and Benchmarks We present all the mo...

work page arXiv 2012