pith. machine review for the scientific record. sign in

arxiv: 2604.24380 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords structural pruninglarge vision language modelsmodel compressionknowledge distillationdata efficiencyrecovery traininglayerwise pruningwidthwise pruning
0
0 comments X

The pith

Structured pruning of vision-language models allows recovery of over 95 percent performance with only 5 percent of the original training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how to compress large vision-language models by pruning their language model backbones in a structured way, either by removing layers or reducing widths, and then recovering performance through lightweight training. The authors test combinations of supervised fine-tuning and knowledge distillation using logits and hidden states, and find that effective recovery is possible even when using just a small portion of the available data. A key finding is that widthwise pruning performs better than layerwise pruning when resources are limited. This matters because it provides a practical path to deploy powerful multimodal models on devices with constrained memory and compute without needing to retrain from scratch or access massive datasets.

Core claim

The central discovery is that after applying layerwise or widthwise structural pruning to the language model component of LVLMs, a combination of supervised finetuning and hidden-state distillation can recover most of the original performance, and that this recovery succeeds using only 5% of the original data while retaining over 95% of performance. Widthwise pruning maintains better performance in low-resource scenarios, and at small compression levels, finetuning only the multimodal projector is sufficient.

What carries the argument

Structured pruning of the language model backbone (layerwise or widthwise) paired with recovery training via supervised finetuning and knowledge distillation on logits and hidden states.

If this is right

  • Widthwise pruning is more robust than layerwise pruning when finetuning data or compute is scarce.
  • For mild pruning ratios, updating only the multimodal projector during recovery is enough to restore performance.
  • The optimal recovery strategy combines supervised finetuning with distillation of hidden states.
  • High performance retention is achievable with recovery training on as little as 5% of the original dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could reduce the need for training smaller LVLMs from scratch by instead compressing larger pretrained ones.
  • Similar pruning and recovery techniques might apply to other large multimodal architectures beyond the tested 3B-7B range.
  • Developers could use this to iteratively prune and recover models to find optimal compression levels without full retraining.

Load-bearing premise

The load-bearing premise is that the pruning dynamics, optimal recovery methods, and data efficiency observed on the three tested LVLM families and benchmarks will hold for other model sizes, tasks, and deployment environments.

What would settle it

Observing that recovery training with 5% data on a new LVLM family or a different multimodal benchmark results in performance retention below 90% of the original would falsify the data-efficiency claim.

read the original abstract

While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper examines structural pruning of LVLMs (3B-7B scale) via layerwise and widthwise methods on the language backbone, followed by recovery via supervised finetuning and logit/hidden-state distillation. It reports that widthwise pruning is more robust under low compute/data, that projector-only finetuning suffices for mild compression, and that recovery training on only 5% of data can retain >95% of original performance across three LVLM families, with code released.

Significance. If the empirical findings hold with proper controls, the work supplies actionable, data-efficient recipes for compressing existing LVLMs without full retraining, which is directly relevant to edge deployment. The multi-family evaluation and open code are strengths that increase the potential utility for practitioners.

major comments (2)
  1. Abstract and results on data efficiency: the central claim that 'effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance' is presented without reported standard deviation across multiple random 5% subsets, different seeds, or repeated subsampling. In low-data regimes this omission is load-bearing, as subset choice can materially affect retention; the manuscript must either add such variance statistics or qualify the claim.
  2. Experimental setup (throughout results sections): the abstract and summary report concrete recovery percentages and method rankings, yet the provided text lacks explicit statements on number of runs, variance across seeds, full baseline comparisons (e.g., unstructured pruning, other distillation variants), and hyperparameter search details for the recovery schedules. These omissions prevent full verification of the performance claims and rankings.
minor comments (3)
  1. Clarify in the methods section how the 5% data subset is sampled (random, stratified, or fixed) and whether the same subset is used across all pruning ratios and models.
  2. Add a table or figure caption that explicitly lists the exact benchmarks, metrics, and original (unpruned) scores for each LVLM family so that the '95% retention' figures can be directly cross-checked.
  3. The distinction between 'layerwise' and 'widthwise' pruning should be illustrated with a small diagram or pseudocode in §3 to avoid ambiguity for readers unfamiliar with the exact structural cuts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: Abstract and results on data efficiency: the central claim that 'effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance' is presented without reported standard deviation across multiple random 5% subsets, different seeds, or repeated subsampling. In low-data regimes this omission is load-bearing, as subset choice can materially affect retention; the manuscript must either add such variance statistics or qualify the claim.

    Authors: We agree that the absence of variance statistics for the 5% data subset experiments is a limitation, particularly in low-data regimes where subset selection can influence results. Our current experiments used a single random 5% subset for each model family. In the revised manuscript, we will perform additional experiments with multiple random subsets and different seeds, reporting mean performance and standard deviations. This will either support or qualify the central claim regarding data efficiency. revision: yes

  2. Referee: Experimental setup (throughout results sections): the abstract and summary report concrete recovery percentages and method rankings, yet the provided text lacks explicit statements on number of runs, variance across seeds, full baseline comparisons (e.g., unstructured pruning, other distillation variants), and hyperparameter search details for the recovery schedules. These omissions prevent full verification of the performance claims and rankings.

    Authors: We acknowledge that more explicit details on the experimental setup would enhance reproducibility and verifiability. The manuscript describes the pruning methods, recovery strategies (supervised finetuning, logit and hidden-state distillation), and evaluations across three LVLM families. However, we did not report the number of runs explicitly (experiments were typically run once per configuration due to computational constraints), nor did we include unstructured pruning baselines or exhaustive hyperparameter sweeps. We will revise the experimental setup section to include these details where available, specify the number of runs, add variance if multiple seeds were tested, and provide hyperparameter information. Regarding additional baselines, we will consider including unstructured pruning comparisons if space permits, but our primary focus was on structured pruning as it is more suitable for deployment. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations

full rationale

The manuscript is an empirical study of pruning and recovery on three LVLM families. It reports experimental outcomes from pruning (layerwise/widthwise), recovery via finetuning/distillation, and data-efficiency tests, with all numbers obtained from held-out evaluations after the interventions. No equations, first-principles derivations, or load-bearing claims reduce to fitted parameters or self-citations by construction; the 5% data result is a direct experimental measurement rather than a renamed fit or imported uniqueness theorem. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions of deep learning rather than new theoretical constructs. No free parameters are introduced as fitted constants; pruning ratios and learning rates are treated as experimental choices.

axioms (1)
  • domain assumption Structured pruning of the language-model backbone preserves enough multimodal capability that lightweight recovery training can restore most performance.
    Invoked when the authors choose to prune only the LM component and then apply recovery rather than retraining the entire model.

pith-pipeline@v0.9.0 · 5586 in / 1198 out tokens · 43481 ms · 2026-05-08T03:43:55.509068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 34 canonical work pages · 9 internal anchors

  1. [1]

    Mobilevlm : A fast, strong and open vision language assistant for mobile devices

    Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al.: Mobilevlm: A fast, repro- ducible and strong vision language assis- tant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L.,et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198 (2024)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024)

  4. [4]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foun- dation language models. arXiv preprint arXiv:2302.13971 (2023)

  5. [5]

    Mas- ter’s thesis, University of Washington (2024)

    Jiang, F.: Identifying and mitigating vulner- abilities in llm-integrated applications. Mas- ter’s thesis, University of Washington (2024)

  6. [6]

    Efficient multimodal learning from data-centric perspective,

    He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)

  7. [7]

    arXiv preprint arXiv:2403.06199 (2024)

    Zhu, M., Zhu, Y., Liu, X., Liu, N., Xu, Z., Shen, C., Peng, Y., Ou, Z., Feng, F., Tang, J.: A comprehensive overhaul of multimodal assistant with small language models. arXiv preprint arXiv:2403.06199 (2024)

  8. [8]

    In: The Twelfth Interna- tional Conference on Learning Representa- tions (2024)

    Sun, M., Liu, Z., Bair, A., Kolter, J.Z.: A sim- ple and effective pruning approach for large language models. In: The Twelfth Interna- tional Conference on Learning Representa- tions (2024)

  9. [9]

    In: International Conference on Machine Learning, pp

    Frantar, E., Alistarh, D.: Sparsegpt: Massive language models can be accurately pruned in one-shot. In: International Conference on Machine Learning, pp. 10323–10337 (2023). PMLR

  10. [10]

    Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

    Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., Han, X., Chen, W.: Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853 (2024)

  11. [11]

    Advances in neural information pro- cessing systems36, 21702–21720 (2023)

    Ma, X., Fang, G., Wang, X.: Llm-pruner: On the structural pruning of large language models. Advances in neural information pro- cessing systems36, 21702–21720 (2023)

  12. [12]

    In: DAGM German Conference on Pattern Recognition, pp

    Huang, Y., Thede, L., Mancini, M., Xu, W., Akata, Z.: Investigating structural prun- ing and recovery techniques for compressing multimodal large language models: An empir- ical study. In: DAGM German Conference on Pattern Recognition, pp. 320–336 (2025). Springer

  13. [13]

    In: European Con- ference on Computer Vision, pp

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: European Con- ference on Computer Vision, pp. 235–251 (2016). Springer

  14. [14]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.- W., Galley, M., Gao, J.: Mathvista: Evalu- ating mathematical reasoning of foundation models in visual contexts. arXiv preprint 18 arXiv:2310.02255 (2023)

  15. [15]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pp

    Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pp. 2200–2209 (2021)

  16. [16]

    Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017

    Dong, X., Chen, S., Pan, S.J.: Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon (2017). https://arxiv. org/abs/1705.07565

  17. [17]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Frankle, J., Carbin, M.: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neu- ral Networks (2019). https://arxiv.org/abs/ 1803.03635

  18. [18]

    A signal propagation perspective for pruning neural networks at initialization

    Lee, N., Ajanthan, T., Gould, S., Torr, P.H.S.: A Signal Propagation Perspective for Prun- ing Neural Networks at Initialization (2020). https://arxiv.org/abs/1906.06307

  19. [19]

    https://arxiv.org/abs/ 2002.04809

    Park, S., Lee, J., Mo, S., Shin, J.: Lookahead: A Far-Sighted Alternative of Magnitude- based Pruning (2020). https://arxiv.org/abs/ 2002.04809

  20. [20]

    Movement pruning: Adaptive sparsity by fine-tuning

    Sanh, V., Wolf, T., Rush, A.M.: Movement Pruning: Adaptive Sparsity by Fine-Tuning (2020). https://arxiv.org/abs/2005.07683

  21. [21]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp

    Farina, M., Mancini, M., Cunegatti, E., Liu, G., Iacca, G., Ricci, E.: Multiflow: Shifting towards task-agnostic vision-language prun- ing. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 16185–16195 (2024)

  22. [22]

    In: International Conference on Learning Representations (2021)

    Zhou, A., Ma, Y., Zhu, J., Liu, J., Zhang, Z., Yuan, K., Sun, W., Li, H.: Learning n: M fine- grained structured sparse neural networks from scratch. In: International Conference on Learning Representations (2021)

  23. [23]

    Journal of Machine Learning Research23, 1–124 (2021)

    Hoefler, T., Alistarh, D., Ben-Nun, T., Dry- den, N., Peste, A.: Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research23, 1–124 (2021)

  24. [24]

    In: International Conference on Learning Representations (2017)

    Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (2017)

  25. [25]

    https://arxiv.org/abs/2108.00708

    Liu, L., Zhang, S., Kuang, Z., Zhou, A., Xue, J.-H., Wang, X., Chen, Y., Yang, W., Liao, Q., Zhang, W.: Group Fisher Pruning for Practical Network Compression (2021). https://arxiv.org/abs/2108.00708

  26. [26]

    https://arxiv.org/ abs/1909.08174

    You, Z., Yan, K., Ye, J., Ma, M., Wang, P.: Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks (2019). https://arxiv.org/ abs/1909.08174

  27. [27]

    Zhang, C., Bengio, S., Singer, Y.: Are all layers created equal? Journal of Machine Learning Research23(67), 1–28 (2022)

  28. [28]

    In: The Thirteenth International Conference on Learning Representations (2024)

    Chen, X., Hu, Y., Zhang, J., Wang, Y., Li, C., Chen, H.: Streamlining redundant lay- ers to compress large language models. In: The Thirteenth International Conference on Learning Representations (2024)

  29. [29]

    In: Forty-first International Conference on Machine Learning (2024)

    Song, J., Oh, K., Kim, T., Kim, H., Kim, Y., et al.: Sleb: Streamlining llms through redun- dancy verification and elimination of trans- former blocks. In: Forty-first International Conference on Machine Learning (2024)

  30. [30]

    https:// arxiv.org/abs/2402.05406

    Dery, L., Kolawole, S., Kagy, J.-F., Smith, V., Neubig, G., Talwalkar, A.: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes (2024). https:// arxiv.org/abs/2402.05406

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Fang, G., Ma, X., Song, M., Mi, M.B., Wang, X.: Depgraph: Towards any structural pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101 (2023)

  32. [32]

    arXiv preprint arXiv:2310.06694 , year=

    Xia, M., Gao, T., Zeng, Z., Chen, D.: Sheared LLaMA: Accelerating Language Model Pre- training via Structured Pruning (2024). https://arxiv.org/abs/2310.06694

  33. [33]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network (2015). https://arxiv.org/abs/1503.02531 19

  34. [34]

    Interna- tional Journal of Computer Vision129(6), 1789–1819 (2021) https://doi.org/10.1007/ s11263-021-01453-z

    Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Interna- tional Journal of Computer Vision129(6), 1789–1819 (2021) https://doi.org/10.1007/ s11263-021-01453-z

  35. [35]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2020). https://arxiv.org/abs/1910.01108

  36. [36]

    Chin-Yew Lin

    Liang, K.J., Hao, W., Shen, D., Zhou, Y., Chen, W., Chen, C., Carin, L.: MixKD: Towards Efficient Distillation of Large-scale Language Models (2021). https://arxiv.org/ abs/2011.00593

  37. [37]

    Distilling step-by-step! 32 outperforming larger language models with less training data and smaller model sizes,

    Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., Pfister, T.: Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (2023). https://arxiv.org/abs/2305.02301

  38. [38]

    https://arxiv.org/abs/1909

    Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: TinyBERT: Dis- tilling BERT for Natural Language Under- standing (2020). https://arxiv.org/abs/1909. 10351

  39. [39]

    URL https://www.aclweb.org/anthology/D13-1170

    Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient Knowledge Distillation for BERT Model Compression (2019). https://arxiv.org/abs/ 1908.09355

  40. [40]

    https: //arxiv.org/abs/2002.10957

    Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (2020). https: //arxiv.org/abs/2002.10957

  41. [41]

    https://arxiv

    Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pre- trained Transformers (2021). https://arxiv. org/abs/2012.15828

  42. [42]

    MiniLLM: On-Policy Distillation of Large Language Models

    Gu, Y., Dong, L., Wei, F., Huang, M.: Knowl- edge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)

  43. [43]

    In: ICML Workshop on Uncer- tainty and Robustness in Deep Learning (UDL), vol

    Hoffmann, J., Agnihotri, S., Saikia, T., Brox, T.: Towards improving robustness of com- pressed cnns. In: ICML Workshop on Uncer- tainty and Robustness in Deep Learning (UDL), vol. 4 (2021)

  44. [44]

    Advances in Neural Information Processing Systems 37, 41076–41102 (2024)

    Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., Molchanov, P.: Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems 37, 41076–41102 (2024)

  45. [45]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Cai, Y., Zhang, J., He, H., He, X., Tong, A., Gan, Z., Wang, C., Xue, Z., Liu, Y., Bai, X.: Llava-kd: A framework of distilling multi- modal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 239–249 (2025)

  46. [46]

    CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

    Kim, J., Kim, K., Seo, S., Park, C.: Com- podistill: Attention distillation for composi- tional reasoning in multimodal llms. arXiv preprint arXiv:2510.12184 (2025)

  47. [47]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp

    Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduc- tion for efficient large multimodal models. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 22857–22867 (2025)

  48. [48]

    In: Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp

    Tang, Z., Ma, Z., Wang, S., Li, Z., Zhang, L., Zhao, H., Li, Y., Wang, Q.: Covipal: Layer- wise contextualized visual token pruning for large vision-language models. In: Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp. 20701–20714 (2025)

  49. [49]

    arXiv preprint arXiv:2509.23931 (2025)

    Wang, H., Xu, Y., Xu, Z., Gao, J., Liu, Y., Hu, W., Wang, K., Zhang, Z.: Autoprune: Each complexity deserves a pruning policy. arXiv preprint arXiv:2509.23931 (2025)

  50. [50]

    In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing Under Limited, pp

    Zhu, Y., Zhu, M., Liu, N., Xu, Z., Peng, Y.: Llava-phi: Efficient multi-modal assistant with small language model. In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing Under Limited, pp. 18–22 (2024) 20

  51. [51]

    arXiv preprint arXiv:1909.11556 , year=

    Fan, A., Grave, E., Joulin, A.: Reduc- ing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019)

  52. [52]

    Computer Speech & Language77, 101429 (2023)

    Sajjad, H., Dalvi, F., Durrani, N., Nakov, P.: On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language77, 101429 (2023)

  53. [53]

    In: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp

    Voita, E., Talbot, D., Moiseev, F., Sen- nrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019)

  54. [54]

    Michel, P., Levy, O., Neubig, G.: Are six- teen heads really better than one? Advances in neural information processing systems32 (2019)

  55. [55]

    arXiv preprint arXiv:1910.06360 (2019)

    McCarley, J., Chakravarti, R., Sil, A.: Structured pruning of a bert-based ques- tion answering model. arXiv preprint arXiv:1910.06360 (2019)

  56. [56]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)

  57. [57]

    Prismatic vlms: Inves- tigating the design space of visually-conditioned language models

    Karamcheti, S., Nair, S., Balakrishna, A., Liang, P., Kollar, T., Sadigh, D.: Pris- matic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865 (2024)

  58. [58]

    The Curious Case of Neural Text Degeneration

    Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neu- ral text degeneration. arXiv preprint arXiv:1904.09751 (2019)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: Clip-kd: An empirical study of clip model distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15952–15962 (2024)

  60. [60]

    arXiv preprint arXiv:2404.16637 (2024)

    Popp, N., Metzen, J.H., Hein, M.: Zero-shot distillation for image encoders: How to make effective use of synthetic data. arXiv preprint arXiv:2404.16637 (2024)

  61. [61]

    In: Proceedings of CVPR (2024)

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal under- standing and reasoning benchmark for expert agi. In: Proceedings of CVPR (2024)

  62. [62]

    Advances in Neural Infor- mation Processing Systems35, 2507–2521 (2022)

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science ques- tion answering. Advances in Neural Infor- mation Processing Systems35, 2507–2521 (2022)

  63. [63]

    Evaluating Object Hallucination in Large Vision-Language Models

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.-R.: Evaluating object halluci- nation in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

  64. [64]

    A survey on multimodal large language models

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multi- modal large language models. arXiv preprint arXiv:2306.13549 (2023)

  65. [65]

    Zenodo (2024)

    Bo, L., Peiyuan, Z., Kaichen, Z., Fanyi, P., Xinrun, D., Yuhao, D., Haotian, L., Yuanhan, Z., Ge, Z., Chunyuan, L., Ziwei, L.: LMMs- Eval: Accelerating the Development of Large Multimoal Models. Zenodo (2024). https:// github.com/EvolvingLMMs-Lab/lmms-eval

  66. [66]

    Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

    Sreenivas, S.T., Muralidharan, S., Joshi, R., Chochowski, M., Mahabaleshwarkar, A.S., Shen, G., Zeng, J., Chen, Z., Suhara, Y., Diao, S., et al.: Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796 (2024)

  67. [67]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 21

  68. [68]

    In: Conference on Learning Theory (COLT), pp

    Telgarsky, M.: Benefits of depth in neural net- works. In: Conference on Learning Theory (COLT), pp. 1517–1539 (2016)

  69. [69]

    In: Forty- first International Conference on Machine Learning (2024)

    Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R.,et al.: Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. In: Forty- first International Conference on Machine Learning (2024)

  70. [70]

    In: International Conference on Learning Representations (2021)

    Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: Lora: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2021)

  71. [71]

    int8 (): 8-bit matrix multi- plication for transformers at scale

    Dettmers, T., Lewis, M., Belkada, Y., Zettle- moyer, L.: Gpt3. int8 (): 8-bit matrix multi- plication for transformers at scale. Advances in Neural Information Processing Systems 35, 30318–30332 (2022) 22