arxiv: 2604.24380 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

Yiran Huang , Lukas Thede , Massimiliano Mancini , Wenjia Xu , Zeynep Akata

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords structural pruninglarge vision language modelsmodel compressionknowledge distillationdata efficiencyrecovery traininglayerwise pruningwidthwise pruning

0 comments

The pith

Structured pruning of vision-language models allows recovery of over 95 percent performance with only 5 percent of the original training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how to compress large vision-language models by pruning their language model backbones in a structured way, either by removing layers or reducing widths, and then recovering performance through lightweight training. The authors test combinations of supervised fine-tuning and knowledge distillation using logits and hidden states, and find that effective recovery is possible even when using just a small portion of the available data. A key finding is that widthwise pruning performs better than layerwise pruning when resources are limited. This matters because it provides a practical path to deploy powerful multimodal models on devices with constrained memory and compute without needing to retrain from scratch or access massive datasets.

Core claim

The central discovery is that after applying layerwise or widthwise structural pruning to the language model component of LVLMs, a combination of supervised finetuning and hidden-state distillation can recover most of the original performance, and that this recovery succeeds using only 5% of the original data while retaining over 95% of performance. Widthwise pruning maintains better performance in low-resource scenarios, and at small compression levels, finetuning only the multimodal projector is sufficient.

What carries the argument

Structured pruning of the language model backbone (layerwise or widthwise) paired with recovery training via supervised finetuning and knowledge distillation on logits and hidden states.

If this is right

Widthwise pruning is more robust than layerwise pruning when finetuning data or compute is scarce.
For mild pruning ratios, updating only the multimodal projector during recovery is enough to restore performance.
The optimal recovery strategy combines supervised finetuning with distillation of hidden states.
High performance retention is achievable with recovery training on as little as 5% of the original dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could reduce the need for training smaller LVLMs from scratch by instead compressing larger pretrained ones.
Similar pruning and recovery techniques might apply to other large multimodal architectures beyond the tested 3B-7B range.
Developers could use this to iteratively prune and recover models to find optimal compression levels without full retraining.

Load-bearing premise

The load-bearing premise is that the pruning dynamics, optimal recovery methods, and data efficiency observed on the three tested LVLM families and benchmarks will hold for other model sizes, tasks, and deployment environments.

What would settle it

Observing that recovery training with 5% data on a new LVLM family or a different multimodal benchmark results in performance retention below 90% of the original would falsify the data-efficiency claim.

read the original abstract

While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs useful head-to-head tests on structured pruning for LVLMs and shows that widthwise pruning plus hidden-state distillation recovers most performance with only 5% data, but the low-data result lacks reported variance across subsets.

read the letter

The core finding worth noting is that widthwise pruning combined with supervised finetuning and hidden-state distillation can get back over 95% of original performance on three LVLM families using just 5% of the recovery data. Projector-only finetuning also works at modest compression levels, and widthwise pruning holds up better than layerwise when data or compute is scarce. Those comparisons give practitioners some concrete options for compressing existing models instead of training smaller ones from scratch.

Referee Report

2 major / 3 minor

Summary. The paper examines structural pruning of LVLMs (3B-7B scale) via layerwise and widthwise methods on the language backbone, followed by recovery via supervised finetuning and logit/hidden-state distillation. It reports that widthwise pruning is more robust under low compute/data, that projector-only finetuning suffices for mild compression, and that recovery training on only 5% of data can retain >95% of original performance across three LVLM families, with code released.

Significance. If the empirical findings hold with proper controls, the work supplies actionable, data-efficient recipes for compressing existing LVLMs without full retraining, which is directly relevant to edge deployment. The multi-family evaluation and open code are strengths that increase the potential utility for practitioners.

major comments (2)

Abstract and results on data efficiency: the central claim that 'effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance' is presented without reported standard deviation across multiple random 5% subsets, different seeds, or repeated subsampling. In low-data regimes this omission is load-bearing, as subset choice can materially affect retention; the manuscript must either add such variance statistics or qualify the claim.
Experimental setup (throughout results sections): the abstract and summary report concrete recovery percentages and method rankings, yet the provided text lacks explicit statements on number of runs, variance across seeds, full baseline comparisons (e.g., unstructured pruning, other distillation variants), and hyperparameter search details for the recovery schedules. These omissions prevent full verification of the performance claims and rankings.

minor comments (3)

Clarify in the methods section how the 5% data subset is sampled (random, stratified, or fixed) and whether the same subset is used across all pruning ratios and models.
Add a table or figure caption that explicitly lists the exact benchmarks, metrics, and original (unpruned) scores for each LVLM family so that the '95% retention' figures can be directly cross-checked.
The distinction between 'layerwise' and 'widthwise' pruning should be illustrated with a small diagram or pseudocode in §3 to avoid ambiguity for readers unfamiliar with the exact structural cuts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses

Referee: Abstract and results on data efficiency: the central claim that 'effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance' is presented without reported standard deviation across multiple random 5% subsets, different seeds, or repeated subsampling. In low-data regimes this omission is load-bearing, as subset choice can materially affect retention; the manuscript must either add such variance statistics or qualify the claim.

Authors: We agree that the absence of variance statistics for the 5% data subset experiments is a limitation, particularly in low-data regimes where subset selection can influence results. Our current experiments used a single random 5% subset for each model family. In the revised manuscript, we will perform additional experiments with multiple random subsets and different seeds, reporting mean performance and standard deviations. This will either support or qualify the central claim regarding data efficiency. revision: yes
Referee: Experimental setup (throughout results sections): the abstract and summary report concrete recovery percentages and method rankings, yet the provided text lacks explicit statements on number of runs, variance across seeds, full baseline comparisons (e.g., unstructured pruning, other distillation variants), and hyperparameter search details for the recovery schedules. These omissions prevent full verification of the performance claims and rankings.

Authors: We acknowledge that more explicit details on the experimental setup would enhance reproducibility and verifiability. The manuscript describes the pruning methods, recovery strategies (supervised finetuning, logit and hidden-state distillation), and evaluations across three LVLM families. However, we did not report the number of runs explicitly (experiments were typically run once per configuration due to computational constraints), nor did we include unstructured pruning baselines or exhaustive hyperparameter sweeps. We will revise the experimental setup section to include these details where available, specify the number of runs, add variance if multiple seeds were tested, and provide hyperparameter information. Regarding additional baselines, we will consider including unstructured pruning comparisons if space permits, but our primary focus was on structured pruning as it is more suitable for deployment. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations

full rationale

The manuscript is an empirical study of pruning and recovery on three LVLM families. It reports experimental outcomes from pruning (layerwise/widthwise), recovery via finetuning/distillation, and data-efficiency tests, with all numbers obtained from held-out evaluations after the interventions. No equations, first-principles derivations, or load-bearing claims reduce to fitted parameters or self-citations by construction; the 5% data result is a direct experimental measurement rather than a renamed fit or imported uniqueness theorem. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions of deep learning rather than new theoretical constructs. No free parameters are introduced as fitted constants; pruning ratios and learning rates are treated as experimental choices.

axioms (1)

domain assumption Structured pruning of the language-model backbone preserves enough multimodal capability that lightweight recovery training can restore most performance.
Invoked when the authors choose to prune only the LM component and then apply recovery rather than retraining the entire model.

pith-pipeline@v0.9.0 · 5586 in / 1198 out tokens · 43481 ms · 2026-05-08T03:43:55.509068+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 34 canonical work pages · 9 internal anchors

[1]

Mobilevlm : A fast, strong and open vision language assistant for mobile devices

Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al.: Mobilevlm: A fast, repro- ducible and strong vision language assis- tant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)

work page arXiv 2023
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L.,et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198 (2024)

2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024)

2024
[4]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foun- dation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review arXiv 2023
[5]

Mas- ter’s thesis, University of Washington (2024)

Jiang, F.: Identifying and mitigating vulner- abilities in llm-integrated applications. Mas- ter’s thesis, University of Washington (2024)

2024
[6]

Efficient multimodal learning from data-centric perspective,

He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)

work page arXiv 2024
[7]

arXiv preprint arXiv:2403.06199 (2024)

Zhu, M., Zhu, Y., Liu, X., Liu, N., Xu, Z., Shen, C., Peng, Y., Ou, Z., Feng, F., Tang, J.: A comprehensive overhaul of multimodal assistant with small language models. arXiv preprint arXiv:2403.06199 (2024)

work page arXiv 2024
[8]

In: The Twelfth Interna- tional Conference on Learning Representa- tions (2024)

Sun, M., Liu, Z., Bair, A., Kolter, J.Z.: A sim- ple and effective pruning approach for large language models. In: The Twelfth Interna- tional Conference on Learning Representa- tions (2024)

2024
[9]

In: International Conference on Machine Learning, pp

Frantar, E., Alistarh, D.: Sparsegpt: Massive language models can be accurately pruned in one-shot. In: International Conference on Machine Learning, pp. 10323–10337 (2023). PMLR

2023
[10]

Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., Han, X., Chen, W.: Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853 (2024)

work page arXiv 2024
[11]

Advances in neural information pro- cessing systems36, 21702–21720 (2023)

Ma, X., Fang, G., Wang, X.: Llm-pruner: On the structural pruning of large language models. Advances in neural information pro- cessing systems36, 21702–21720 (2023)

2023
[12]

In: DAGM German Conference on Pattern Recognition, pp

Huang, Y., Thede, L., Mancini, M., Xu, W., Akata, Z.: Investigating structural prun- ing and recovery techniques for compressing multimodal large language models: An empir- ical study. In: DAGM German Conference on Pattern Recognition, pp. 320–336 (2025). Springer

2025
[13]

In: European Con- ference on Computer Vision, pp

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: European Con- ference on Computer Vision, pp. 235–251 (2016). Springer

2016
[14]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.- W., Galley, M., Gao, J.: Mathvista: Evalu- ating mathematical reasoning of foundation models in visual contexts. arXiv preprint 18 arXiv:2310.02255 (2023)

work page internal anchor Pith review arXiv 2023
[15]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pp

Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pp. 2200–2209 (2021)

2021
[16]

Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017

Dong, X., Chen, S., Pan, S.J.: Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon (2017). https://arxiv. org/abs/1705.07565

work page arXiv 2017
[17]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle, J., Carbin, M.: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neu- ral Networks (2019). https://arxiv.org/abs/ 1803.03635

work page Pith review arXiv 2019
[18]

A signal propagation perspective for pruning neural networks at initialization

Lee, N., Ajanthan, T., Gould, S., Torr, P.H.S.: A Signal Propagation Perspective for Prun- ing Neural Networks at Initialization (2020). https://arxiv.org/abs/1906.06307

work page arXiv 2020
[19]

https://arxiv.org/abs/ 2002.04809

Park, S., Lee, J., Mo, S., Shin, J.: Lookahead: A Far-Sighted Alternative of Magnitude- based Pruning (2020). https://arxiv.org/abs/ 2002.04809

work page arXiv 2020
[20]

Movement pruning: Adaptive sparsity by ﬁne-tuning

Sanh, V., Wolf, T., Rush, A.M.: Movement Pruning: Adaptive Sparsity by Fine-Tuning (2020). https://arxiv.org/abs/2005.07683

work page arXiv 2020
[21]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp

Farina, M., Mancini, M., Cunegatti, E., Liu, G., Iacca, G., Ricci, E.: Multiflow: Shifting towards task-agnostic vision-language prun- ing. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 16185–16195 (2024)

2024
[22]

In: International Conference on Learning Representations (2021)

Zhou, A., Ma, Y., Zhu, J., Liu, J., Zhang, Z., Yuan, K., Sun, W., Li, H.: Learning n: M fine- grained structured sparse neural networks from scratch. In: International Conference on Learning Representations (2021)

2021
[23]

Journal of Machine Learning Research23, 1–124 (2021)

Hoefler, T., Alistarh, D., Ben-Nun, T., Dry- den, N., Peste, A.: Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research23, 1–124 (2021)

2021
[24]

In: International Conference on Learning Representations (2017)

Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (2017)

2017
[25]

https://arxiv.org/abs/2108.00708

Liu, L., Zhang, S., Kuang, Z., Zhou, A., Xue, J.-H., Wang, X., Chen, Y., Yang, W., Liao, Q., Zhang, W.: Group Fisher Pruning for Practical Network Compression (2021). https://arxiv.org/abs/2108.00708

work page arXiv 2021
[26]

https://arxiv.org/ abs/1909.08174

You, Z., Yan, K., Ye, J., Ma, M., Wang, P.: Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks (2019). https://arxiv.org/ abs/1909.08174

work page arXiv 2019
[27]

Zhang, C., Bengio, S., Singer, Y.: Are all layers created equal? Journal of Machine Learning Research23(67), 1–28 (2022)

2022
[28]

In: The Thirteenth International Conference on Learning Representations (2024)

Chen, X., Hu, Y., Zhang, J., Wang, Y., Li, C., Chen, H.: Streamlining redundant lay- ers to compress large language models. In: The Thirteenth International Conference on Learning Representations (2024)

2024
[29]

In: Forty-first International Conference on Machine Learning (2024)

Song, J., Oh, K., Kim, T., Kim, H., Kim, Y., et al.: Sleb: Streamlining llms through redun- dancy verification and elimination of trans- former blocks. In: Forty-first International Conference on Machine Learning (2024)

2024
[30]

https:// arxiv.org/abs/2402.05406

Dery, L., Kolawole, S., Kagy, J.-F., Smith, V., Neubig, G., Talwalkar, A.: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes (2024). https:// arxiv.org/abs/2402.05406

work page arXiv 2024
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Fang, G., Ma, X., Song, M., Mi, M.B., Wang, X.: Depgraph: Towards any structural pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101 (2023)

2023
[32]

arXiv preprint arXiv:2310.06694 , year=

Xia, M., Gao, T., Zeng, Z., Chen, D.: Sheared LLaMA: Accelerating Language Model Pre- training via Structured Pruning (2024). https://arxiv.org/abs/2310.06694

work page arXiv 2024
[33]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network (2015). https://arxiv.org/abs/1503.02531 19

work page internal anchor Pith review arXiv 2015
[34]

Interna- tional Journal of Computer Vision129(6), 1789–1819 (2021) https://doi.org/10.1007/ s11263-021-01453-z

Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Interna- tional Journal of Computer Vision129(6), 1789–1819 (2021) https://doi.org/10.1007/ s11263-021-01453-z

2021
[35]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2020). https://arxiv.org/abs/1910.01108

work page internal anchor Pith review arXiv 2020
[36]

Chin-Yew Lin

Liang, K.J., Hao, W., Shen, D., Zhou, Y., Chen, W., Chen, C., Carin, L.: MixKD: Towards Efficient Distillation of Large-scale Language Models (2021). https://arxiv.org/ abs/2011.00593

work page arXiv 2021
[37]

Distilling step-by-step! 32 outperforming larger language models with less training data and smaller model sizes,

Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., Pfister, T.: Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (2023). https://arxiv.org/abs/2305.02301

work page arXiv 2023
[38]

https://arxiv.org/abs/1909

Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: TinyBERT: Dis- tilling BERT for Natural Language Under- standing (2020). https://arxiv.org/abs/1909. 10351

2020
[39]

URL https://www.aclweb.org/anthology/D13-1170

Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient Knowledge Distillation for BERT Model Compression (2019). https://arxiv.org/abs/ 1908.09355

work page arXiv 2019
[40]

https: //arxiv.org/abs/2002.10957

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (2020). https: //arxiv.org/abs/2002.10957

work page arXiv 2020
[41]

https://arxiv

Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pre- trained Transformers (2021). https://arxiv. org/abs/2012.15828

work page arXiv 2021
[42]

MiniLLM: On-Policy Distillation of Large Language Models

Gu, Y., Dong, L., Wei, F., Huang, M.: Knowl- edge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)

work page internal anchor Pith review arXiv 2023
[43]

In: ICML Workshop on Uncer- tainty and Robustness in Deep Learning (UDL), vol

Hoffmann, J., Agnihotri, S., Saikia, T., Brox, T.: Towards improving robustness of com- pressed cnns. In: ICML Workshop on Uncer- tainty and Robustness in Deep Learning (UDL), vol. 4 (2021)

2021
[44]

Advances in Neural Information Processing Systems 37, 41076–41102 (2024)

Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., Molchanov, P.: Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems 37, 41076–41102 (2024)

2024
[45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Cai, Y., Zhang, J., He, H., He, X., Tong, A., Gan, Z., Wang, C., Xue, Z., Liu, Y., Bai, X.: Llava-kd: A framework of distilling multi- modal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 239–249 (2025)

2025
[46]

CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

Kim, J., Kim, K., Seo, S., Park, C.: Com- podistill: Attention distillation for composi- tional reasoning in multimodal llms. arXiv preprint arXiv:2510.12184 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp

Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduc- tion for efficient large multimodal models. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 22857–22867 (2025)

2025
[48]

In: Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp

Tang, Z., Ma, Z., Wang, S., Li, Z., Zhang, L., Zhao, H., Li, Y., Wang, Q.: Covipal: Layer- wise contextualized visual token pruning for large vision-language models. In: Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp. 20701–20714 (2025)

2025
[49]

arXiv preprint arXiv:2509.23931 (2025)

Wang, H., Xu, Y., Xu, Z., Gao, J., Liu, Y., Hu, W., Wang, K., Zhang, Z.: Autoprune: Each complexity deserves a pruning policy. arXiv preprint arXiv:2509.23931 (2025)

work page arXiv 2025
[50]

In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing Under Limited, pp

Zhu, Y., Zhu, M., Liu, N., Xu, Z., Peng, Y.: Llava-phi: Efficient multi-modal assistant with small language model. In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing Under Limited, pp. 18–22 (2024) 20

2024
[51]

arXiv preprint arXiv:1909.11556 , year=

Fan, A., Grave, E., Joulin, A.: Reduc- ing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019)

work page arXiv 1909
[52]

Computer Speech & Language77, 101429 (2023)

Sajjad, H., Dalvi, F., Durrani, N., Nakov, P.: On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language77, 101429 (2023)

2023
[53]

In: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp

Voita, E., Talbot, D., Moiseev, F., Sen- nrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019)

2019
[54]

Michel, P., Levy, O., Neubig, G.: Are six- teen heads really better than one? Advances in neural information processing systems32 (2019)

2019
[55]

arXiv preprint arXiv:1910.06360 (2019)

McCarley, J., Chakravarti, R., Sil, A.: Structured pruning of a bert-based ques- tion answering model. arXiv preprint arXiv:1910.06360 (2019)

work page arXiv 1910
[56]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)

2019
[57]

Prismatic vlms: Inves- tigating the design space of visually-conditioned language models

Karamcheti, S., Nair, S., Balakrishna, A., Liang, P., Kollar, T., Sadigh, D.: Pris- matic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865 (2024)

work page arXiv 2024
[58]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neu- ral text degeneration. arXiv preprint arXiv:1904.09751 (2019)

work page internal anchor Pith review arXiv 1904
[59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: Clip-kd: An empirical study of clip model distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15952–15962 (2024)

2024
[60]

arXiv preprint arXiv:2404.16637 (2024)

Popp, N., Metzen, J.H., Hein, M.: Zero-shot distillation for image encoders: How to make effective use of synthetic data. arXiv preprint arXiv:2404.16637 (2024)

work page arXiv 2024
[61]

In: Proceedings of CVPR (2024)

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal under- standing and reasoning benchmark for expert agi. In: Proceedings of CVPR (2024)

2024
[62]

Advances in Neural Infor- mation Processing Systems35, 2507–2521 (2022)

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science ques- tion answering. Advances in Neural Infor- mation Processing Systems35, 2507–2521 (2022)

2022
[63]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.-R.: Evaluating object halluci- nation in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

work page internal anchor Pith review arXiv 2023
[64]

A survey on multimodal large language models

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multi- modal large language models. arXiv preprint arXiv:2306.13549 (2023)

work page arXiv 2023
[65]

Zenodo (2024)

Bo, L., Peiyuan, Z., Kaichen, Z., Fanyi, P., Xinrun, D., Yuhao, D., Haotian, L., Yuanhan, Z., Ge, Z., Chunyuan, L., Ziwei, L.: LMMs- Eval: Accelerating the Development of Large Multimoal Models. Zenodo (2024). https:// github.com/EvolvingLMMs-Lab/lmms-eval

2024
[66]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

Sreenivas, S.T., Muralidharan, S., Joshi, R., Chochowski, M., Mahabaleshwarkar, A.S., Shen, G., Zeng, J., Chen, Z., Suhara, Y., Diao, S., et al.: Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796 (2024)

work page arXiv 2024
[67]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 21

work page internal anchor Pith review arXiv 2021
[68]

In: Conference on Learning Theory (COLT), pp

Telgarsky, M.: Benefits of depth in neural net- works. In: Conference on Learning Theory (COLT), pp. 1517–1539 (2016)

2016
[69]

In: Forty- first International Conference on Machine Learning (2024)

Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R.,et al.: Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. In: Forty- first International Conference on Machine Learning (2024)

2024
[70]

In: International Conference on Learning Representations (2021)

Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: Lora: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2021)

2021
[71]

int8 (): 8-bit matrix multi- plication for transformers at scale

Dettmers, T., Lewis, M., Belkada, Y., Zettle- moyer, L.: Gpt3. int8 (): 8-bit matrix multi- plication for transformers at scale. Advances in Neural Information Processing Systems 35, 30318–30332 (2022) 22

2022