Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models
Pith reviewed 2026-05-21 14:43 UTC · model grok-4.3
The pith
Certain layers in vision-language models interfere with specific tasks, and bypassing them improves performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In pretrained VLMs, some layers act as task-interfering layers that reduce performance on downstream tasks. By measuring performance changes after intervening on each layer, the authors find consistent improvements when certain layers are bypassed. These interfering layers display task-specific patterns, with similar tasks showing high similarity in their task-layer interaction vectors. TaLo uses this to dynamically knock out the most interfering layer without training, achieving gains such as 16.6% on the Maps task in ScienceQA using Qwen-VL.
What carries the argument
Task-Layer Interaction Vector, which quantifies the impact of intervening on each layer for a particular task by tracking performance changes.
Load-bearing premise
Zeroing out a layer's parameters accurately isolates its interfering effect without causing other unintended changes in how the model computes outputs.
What would settle it
An experiment where randomly bypassing layers yields similar or better improvements than targeting the identified interfering ones, or where zeroing layers fails to improve performance when confounding factors are controlled.
Figures
read the original abstract
Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that pretrained vision-language models contain task-interfering layers whose removal via intervention (e.g., zeroing parameters) can improve downstream task performance. It introduces the Task-Layer Interaction Vector to quantify per-layer effects, observes consistent patterns across similar tasks, and proposes the training-free TaLo method that dynamically bypasses the most interfering layer at test time, reporting gains such as 16.6% on the ScienceQA Maps task for Qwen-VL.
Significance. If the layer-intervention results prove robust to alternative bypass mechanisms, the work would usefully demonstrate unexpected modularity in VLMs and supply a simple plug-and-play inference-time adaptation technique. The direct empirical measurements and cross-model/dataset observations are strengths; the introduction of the interaction vector provides a concrete, falsifiable way to characterize layer-task relationships.
major comments (2)
- [Layer intervention experiments (Section 3)] The identification of task-interfering layers rests on zeroing layer parameters as the primary intervention. This proxy can rescale downstream activations, interact with layer norms, and alter residual dynamics in ways that differ from a true layer bypass (e.g., a skip connection or attention masking). Without a side-by-side comparison of zeroing versus explicit bypass on the same layers and tasks, the performance gains (including the 16.6% figure) cannot be confidently attributed to removal of interference rather than intervention side-effects.
- [Results and TaLo evaluation (Section 4)] The generalizability claim across models and datasets requires fuller reporting of all layer-intervention outcomes, including negative or neutral results, together with statistical controls for multiple testing. Selective highlighting of improvements risks overstating the prevalence and reliability of task-interfering layers.
minor comments (2)
- [Method (Section 3.1)] The definition and exact computation of the Task-Layer Interaction Vector should be stated with an equation or pseudocode to allow replication.
- [Figures 4-5] Figure captions and axis labels for the similarity matrices of task-layer vectors need clearer annotation to make the claimed high similarity between related tasks immediately visible.
Simulated Author's Rebuttal
Thank you for your thorough review and valuable suggestions. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Layer intervention experiments (Section 3)] The identification of task-interfering layers rests on zeroing layer parameters as the primary intervention. This proxy can rescale downstream activations, interact with layer norms, and alter residual dynamics in ways that differ from a true layer bypass (e.g., a skip connection or attention masking). Without a side-by-side comparison of zeroing versus explicit bypass on the same layers and tasks, the performance gains (including the 16.6% figure) cannot be confidently attributed to removal of interference rather than intervention side-effects.
Authors: We appreciate this important distinction between zeroing and a pure architectural bypass. Zeroing was chosen as a direct ablation to nullify a layer's contribution, following standard practices in neural network interpretability. We acknowledge that side effects on norms and residuals could contribute to observed gains. To isolate the effect, we will add side-by-side experiments in the revised manuscript comparing zeroing against explicit skip connections and attention masking on the same layers and tasks, including the ScienceQA Maps example. revision: yes
-
Referee: [Results and TaLo evaluation (Section 4)] The generalizability claim across models and datasets requires fuller reporting of all layer-intervention outcomes, including negative or neutral results, together with statistical controls for multiple testing. Selective highlighting of improvements risks overstating the prevalence and reliability of task-interfering layers.
Authors: We agree that complete reporting strengthens the claims. The manuscript already covers multiple models and datasets with some neutral outcomes noted, but we will expand Section 4 and the appendix to include a full table of all layer-intervention results (positive, neutral, and negative) across experiments. We will also add statistical controls such as corrected p-values or confidence intervals to account for multiple testing. revision: yes
Circularity Check
No significant circularity: empirical interventions and direct measurements
full rationale
The paper's core claims rest on direct empirical measurements: zeroing individual layer parameters, recording performance deltas on downstream tasks, and defining the Task-Layer Interaction Vector from those observed changes. TaLo is then a simple selection rule that picks the layer with the largest negative delta for a given task. No equations reduce a claimed prediction to a fitted input by construction, no self-citations bear the central premise, and no uniqueness theorems or ansatzes are imported from prior author work. The results are presented as falsifiable experimental outcomes across multiple models and datasets rather than derived quantities, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Intervening on a layer by zeroing its parameters reveals whether that layer helps or interferes with a given task.
invented entities (1)
-
Task-Layer Interaction Vector
no independent evidence
Reference graph
Works this paper leans on
-
[1]
De- tecting and pruning prominent but detrimental neurons in large language models
Ameen Ali Ali, Shahar Katz, Lior Wolf, and Ivan Titov. De- tecting and pruning prominent but detrimental neurons in large language models. InProceedings of the Second Con- ference on Language Modeling, 2025. 3
work page 2025
-
[2]
Data- efficient learning via minimizing hyperspherical energy
Xiaofeng Cao, Weiyang Liu, and Ivor W Tsang. Data- efficient learning via minimizing hyperspherical energy. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 45(11):13422–13437, 2023. 6
work page 2023
-
[3]
Xiaofeng Cao, Yaming Guo, Heng Tao Shen, Ivor W Tsang, and James T Kwok. Mentored learning: Improving general- ization and convergence of student learner.Journal of Ma- chine Learning Research, 25(325):1–45, 2024. 6
work page 2024
-
[4]
Are we on the right way for evaluating large vision-language models? InNeurIPS 2024,
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InNeurIPS 2024, . 6, 7, 2
work page 2024
-
[5]
Qizhou Chen, Taolin Zhang, Chengyu Wang, Xiaofeng He, Dakan Wang, and Tingting Liu. Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit. InAAAI-25, pages 2168– 2176, . 3
-
[6]
Bring reason to vision: Understanding perception and reasoning through model merging, 2025
Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging, 2025. 2, 3, 7
work page 2025
-
[7]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 6, 2
work page 2024
-
[8]
Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, compar- ison, analysis, and recommendations.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):10558–10578, 2024. 3
work page 2024
-
[9]
Knowledge neurons in pretrained transform- ers
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transform- ers. InACL,2022, pages 8493–8502. 3
work page 2022
-
[10]
Editing factual knowledge in language models
Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, 2021. 3
work page 2021
-
[11]
Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 1
work page 2025
-
[12]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM,2024, pages 11198–11201, 2024. 7
work page 2024
-
[13]
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Mahesh- wary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit- levels.arXiv preprint arXiv:2406.17415, 2024. 3
-
[14]
Diverse data augmentation with diffusions for effective test-time prompt tuning
Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InICCV 2023, pages 2704–2714. 3
work page 2023
-
[15]
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR
-
[16]
Ziang Guo, Zakhar Yagudin, Artem Lykov, Mikhail Ko- nenkov, and Dzmitry Tsetserukou. Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes. In2nd Interna- tional Conference on Foundation and Large Language Mod- els, FLLM 2024, pages 501–507. 1
work page 2024
-
[17]
Channel pruning for accelerating very deep neural networks
Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. InICCV 2017, pages 1398–1406. 3
work page 2017
-
[18]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022. 7
work page 2022
-
[19]
Lan- guage is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Johan Bertil Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Lan- guage is not all you need: Aligning perception with language models. InNeurIPS 2023. 1
work page 2023
-
[20]
Editing models with task arithmetic
Gabriel Ilharco, Marco T ´ulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh In- ternational Conference on Learning Representations, ICLR
-
[21]
Test-time classifier ad- justment module for model-agnostic domain generalization
Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier ad- justment module for model-agnostic domain generalization. InNeurIPS, 2021. 3
work page 2021
-
[22]
Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El-Saddik, and Eric P. Xing. Efficient test-time adaptation of vision-language models. InCVPR 2024, pages 14162– 14171. 3
work page 2024
-
[23]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 2, 4
work page 2024
-
[25]
Llava-med: Training a large language- and-vision assistant for biomedicine in one day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, 9 and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InNeurIPS
-
[26]
Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025. 1
-
[27]
Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV , 2024, pages 216–233. 6, 2
work page 2024
-
[28]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,
-
[29]
Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 2
work page 2024
-
[30]
An image enhancing pattern-based sparsity for real-time inference on mobile de- vices
Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. An image enhancing pattern-based sparsity for real-time inference on mobile de- vices. InECCV, 2020, pages 629–645. 3
work page 2020
-
[31]
Shortgpt: Layers in large language mod- els are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingn- ing Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language mod- els are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, pages 20192–20204, 2025. 1
work page 2025
-
[32]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In NeurIPS 2022, . 3
work page 2022
-
[33]
Andonian, Yonatan Belinkov, and David Bau
Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, . 3
work page 2023
-
[34]
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, . 3
work page 2022
-
[35]
Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. Memory-based model editing at scale. InInternational Conference on Machine Learning, ICML 2022, pages 15817–15831, . 3
work page 2022
-
[36]
Compact language models via pruning and knowledge distil- lation
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation. InNeurIPS, pages 41076–41102, 2024. 3
work page 2024
-
[37]
Controlling text-to-image diffusion by orthogo- nal finetuning
Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch¨olkopf. Controlling text-to-image diffusion by orthogo- nal finetuning. InNeurIPS 2023. 7
work page 2023
-
[38]
Improving robustness against common corruptions by covariate shift adaptation
Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. InNeurIPS 2020. 3
work page 2020
-
[39]
Test- time prompt tuning for zero-shot generalization in vision- language models
Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. InNeurIPS 2022. 3
work page 2022
-
[40]
A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024
Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov. A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024. 3
-
[41]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV 2024, pages 256–
work page 2024
-
[42]
Transformer- squared: Self-adaptive llms
Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer- squared: Self-adaptive llms. InThe 13th International Con- ference on Learning Representations, ICLR 2025, . 6
work page 2025
-
[43]
The curse of depth in large language models.arXiv preprint arXiv:2502.05795,
Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025. 1
-
[44]
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Ma- chine Learning, ICML 2020, pages 9229–9248, . 3
work page 2020
-
[45]
Docllm: A layout-aware genera- tive language model for multimodal document understand- ing
Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nour- bakhsh, and Xiaomo Liu. Docllm: A layout-aware genera- tive language model for multimodal document understand- ing. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics, ACL 2024, pages 8529–8548, . 1
work page 2024
-
[46]
Ol- shausen, and Trevor Darrell
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In9th International Conference on Learning Representations, ICLR 2021, . 3
work page 2021
-
[47]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Skipnet: Learning dynamic routing in convolutional networks
Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. InECCV, pages 409–424, 2018. 6
work page 2018
-
[49]
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.CoRR, abs/2408.07666, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Medical large vision language models with multi-image visual ability
Xikai Yang, Juzheng Miao, Yuchen Yuan, Jiaze Wang, Qi Dou, Jinpeng Li, and Pheng-Ann Heng. Medical large vision language models with multi-image visual ability. InMedi- cal Image Computing and Computer Assisted Intervention - MICCAI 2025, pages 402–412. 1 10
work page 2025
-
[51]
Chenyu Yi, Siyuan Yang, Yufei Wang, Haoliang Li, Yap- Peng Tan, and Alex C. Kot. Temporal coherent test time op- timization for robust video classification. InThe Eleventh In- ternational Conference on Learning Representations, ICLR 2023, 2023. 3
work page 2023
-
[52]
Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Kumar Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise spar- sity (OWL): A missing secret sauce for pruning llms to high sparsity. InForty-first International Conference on Machine Learning, ICML 2024. 3, 1
work page 2024
-
[53]
A survey on multimodal large language models.National Science Review, 11(12), 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), 2024. 1
work page 2024
-
[54]
Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...
work page 2024
-
[55]
MEMO: test time robustness via adaptation and augmentation
Marvin Zhang, Sergey Levine, and Chelsea Finn. MEMO: test time robustness via adaptation and augmentation. In NeurIPS 2022. 3
work page 2022
-
[56]
Investigat- ing layer importance in large language models
Yang Zhang, Yanfei Dong, and Kenji Kawaguchi. Investigat- ing layer importance in large language models. InProceed- ings of the 7th BlackboxNLP Workshop: Analyzing and In- terpreting Neural Networks for NLP, pages 469–479, 2024. 2, 3
work page 2024
-
[57]
Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhi- wei Fei, Hui Su, and Xiaoyu Shen. Skipgpt: Dynamic layer pruning reinvented with token awareness and module decou- pling.CoRR, abs/2506.04179, 2025. 6
-
[58]
Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models
Kai Zheng, Wei Wu, Rui Feng, and et al. Regularized mask tuning: Uncovering hidden knowledge in pre-trained vision- language models. InProceedings of CVPR 2023, pages 11663–11673. 3
work page 2023
-
[59]
Modifying memories in transformer models, 2020
Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix X. Yu, and Sanjiv Ku- mar. Modifying memories in transformer models.CoRR, abs/2012.00363, 2020. 3 11 Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models Supplementary Material A. Models and Benchmarks We present all the mo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.