MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Aritra Dutta; Somak Aditya

arxiv: 2605.25842 · v2 · pith:5NRPO766new · submitted 2026-05-25 · 💻 cs.AI · cs.CL

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Aritra Dutta , Somak Aditya This is my paper

Pith reviewed 2026-06-29 21:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords structured pruningchain-of-thought reasoningvision-language modelsmodel compressionmultimodal reasoningparameter pruningreasoning preservation

0 comments

The pith

MuCRASP prunes vision-language models by targeting CoT pivot tokens and modality-specific activations to preserve reasoning accuracy under compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard structured pruning removes parameters without regard for the sparse transition points that sustain chain-of-thought consistency in multimodal generation. It further claims that visual and textual activations follow different distributions, so pruning rules derived from text-only models damage cross-modal alignment. MuCRASP therefore selects which components to remove by combining reasoning-critical identification, cross-modal preservation, and layer-wise sensitivity within one global parameter budget. If this selection rule works, VLMs could be made substantially smaller while still solving physical-reasoning and other step-by-step multimodal tasks at close to their original quality.

Core claim

MuCRASP is a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget; experiments on four VLMs across three reasoning benchmarks show it consistently preserves reasoning quality under increasing compression, maintains high reasoning consistency up to 50 percent pruning, and exhibits lower perplexity degradation than prior methods.

What carries the argument

MuCRASP, the structured pruning framework that identifies and protects reasoning-critical components by locating pivot tokens in CoT trajectories and respecting activation-distribution differences between visual and textual modalities under a single global budget.

If this is right

Reasoning consistency remains high even after 50 percent of parameters are removed.
Perplexity rises more slowly than under CoT-agnostic pruning.
The same global budget can be allocated across layers according to measured sensitivity rather than uniform ratios.
Cross-modal alignment is retained because pruning decisions explicitly consider both visual and textual activation statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pivot-token logic could be applied to prune models that generate long chains of tool calls or multi-step plans in non-visual domains.
If pivot tokens prove stable across tasks, a one-time identification pass might suffice for repeated pruning at different compression targets.
Layer-wise sensitivity maps computed once on a calibration set could guide pruning for families of related VLMs without re-running the full search.

Load-bearing premise

CoT consistency in VLMs depends on sparse pivot tokens whose removal by existing pruning methods destroys accuracy, and visual versus textual activations differ enough that unimodal pruning rules cannot handle them.

What would settle it

A side-by-side evaluation on the same physical-reasoning benchmark and the same Qwen2.5-VL-7B model at 30 percent pruning in which MuCRASP does not produce a higher LLM-as-a-Judge score than the strongest baseline.

Figures

Figures reproduced from arXiv: 2605.25842 by Aritra Dutta, Somak Aditya.

**Figure 1.** Figure 1: Overview of µCRASP. Gradient attribution and trajectory-pivot attribution provide complementary pruning signals, which are fused using a pruning-ratio-dependent coefficient γdyn. Per-layer cross-modal dependency (CMDS) and output sensitivity define a protection factor pf that preserves vision-language integration layers under aggressive pruning. The resulting importance scores are used by a greedy knapsack… view at source ↗

**Figure 2.** Figure 2: Performance across pruning methods under increasing pruning level on Qwen2.5-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: (a) MLP neuron retention after 40% pruning (bars) overlaid with per-layer MLP CMDS (red line). Layers in the peak CMDS region (11 − 18) retain substantially more neurons. (b) Sliding window ablation on Qwen2.5-VL-7B Physical: zeroing out four contiguous layers at each position. Ablating low-CMDS layers (4-7) causes only moderate degradation, whereas ablating high-CMDS layers (12-15, 16-19) triggers catastr… view at source ↗

**Figure 4.** Figure 4: Per-token KL divergence distributions (DKL(dense∥pruned)) on Qwen2.5-VL-7B at 45% pruning across all three reasoning domains. Dashed vertical lines mark per-method means.µCRASP concentrates KL mass near zero (mean 0.92 nats) in every domain, while baselines exhibit broader and right-shifted distributions. The near-identical distributional structure across Physical, Commonsense, and Quantity confirms that µ… view at source ↗

**Figure 5.** Figure 5: Compression–performance curves for LLAMA 3.2 11B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Compression–performance curves for Qwen2.5-VL-7B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Compression–performance curves for Gemma-3-4B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Compression–performance curves for Qwen2.5-VL-3B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Compression–performance curves for Qwen2-VL-2B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of reasoning traces under structured pruning on player colour [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of reasoning traces under structured pruning on urban scene [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of reasoning traces under structured pruning. [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of reasoning traces under structured pruning on counting-based [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of reasoning traces under structured pruning on object counting [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison of reasoning traces under structured pruning on vehicle identifica [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison of reasoning traces under structured pruning on fruit counting [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison of reasoning traces under structured pruning on colour identifica [PITH_FULL_IMAGE:figures/full_fig_p045_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative comparison of reasoning traces under structured pruning on traffic-sign [PITH_FULL_IMAGE:figures/full_fig_p046_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison of reasoning traces under structured pruning on contextual mascot [PITH_FULL_IMAGE:figures/full_fig_p047_19.png] view at source ↗

read the original abstract

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuCRASP targets CoT pivot tokens and cross-modal activations for VLM pruning, but its main evidence rests on unvalidated LLM-as-a-Judge scores.

read the letter

MuCRASP is a structured pruning framework for vision-language models that tries to protect chain-of-thought reasoning by focusing on sparse pivot tokens in the generation path and by handling activation differences between visual and text modalities under a global budget.

The paper does a solid job spelling out why standard pruning methods, built for unimodal LLMs, miss these factors. It reports results across four VLMs and three reasoning benchmarks, with concrete gains such as the 8.87 versus 7.32 LLM judge score at 30 percent pruning on Qwen2.5-VL-7B for physical reasoning, plus maintained consistency out to 50 percent pruning and lower perplexity degradation.

The clearest weakness is the evaluation. The central claim that reasoning quality is preserved rests almost entirely on the LLM-as-a-Judge metric, yet the abstract supplies no correlation data with human judgments, exact-match accuracy, or inter-judge agreement. If the judge model responds to token-distribution shifts induced by the pruning, the numerical gap does not necessarily show true CoT preservation. Details on how pivot tokens are identified and on layer-wise sensitivity calculations are also thin, which makes it harder to judge reproducibility.

This is aimed at researchers working on efficient deployment of multimodal models where reasoning chains are important. Readers who need practical compression ideas for VLMs will find the targeted framework worth examining, even if they have to fill in the evaluation gaps themselves.

The work shows clear engagement with a real limitation in prior pruning approaches, so it deserves a serious referee. I would send it out for review, with the expectation that referees will press on metric validation and ablations around the pivot-token mechanism.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MuCRASP, a structured pruning framework for vision-language models that incorporates awareness of chain-of-thought (CoT) reasoning. It identifies two failure modes in prior pruning methods—ignorance of sparse pivot tokens critical for CoT consistency and failure to handle cross-modal activation differences—and designs pruning to target reasoning-critical components while preserving cross-modal alignment under a global parameter budget with layer-wise sensitivity. Experiments across four VLMs and three reasoning benchmarks claim that MuCRASP preserves reasoning quality better than baselines under compression, including an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline at 30% pruning on Qwen2.5-VL-7B physical reasoning tasks, with maintained consistency up to 50% pruning and lower perplexity degradation.

Significance. If the results hold under rigorous evaluation, the work addresses a practical barrier to deploying reasoning-capable VLMs by offering a pruning approach that better maintains CoT performance than existing methods, potentially enabling more efficient multimodal systems.

major comments (2)

[Abstract] Abstract: The sole quantitative evidence for the central claim that MuCRASP 'preserves reasoning quality' and 'significantly outperforming prior pruning approaches' is the LLM-as-a-Judge score (8.87 vs. 7.32 at 30% pruning). No correlation with human judgments, exact-match task accuracy, or inter-judge agreement is reported, so it is unclear whether the gap demonstrates true CoT preservation or merely stylistic preference by the judge model.
[Abstract] Abstract: The motivation rests on the assertions that (1) CoT consistency depends on sparse pivot tokens and (2) unimodal pruning methods ignore visual/textual activation differences. These are presented as the reasons existing methods fail, yet the abstract supplies no supporting analysis, ablation, or measurement showing these factors are load-bearing for the observed performance gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will make revisions to strengthen the presentation of evidence and motivation while remaining faithful to the manuscript's content.

read point-by-point responses

Referee: [Abstract] Abstract: The sole quantitative evidence for the central claim that MuCRASP 'preserves reasoning quality' and 'significantly outperforming prior pruning approaches' is the LLM-as-a-Judge score (8.87 vs. 7.32 at 30% pruning). No correlation with human judgments, exact-match task accuracy, or inter-judge agreement is reported, so it is unclear whether the gap demonstrates true CoT preservation or merely stylistic preference by the judge model.

Authors: We acknowledge the referee's point that the abstract centers the LLM-as-a-Judge score for the reasoning preservation claim. The manuscript does report supplementary metrics including lower perplexity degradation and maintained reasoning consistency up to 50% pruning. However, we agree that the absence of reported human correlation, exact-match accuracy, or inter-judge agreement leaves the interpretation of the judge score open to the concern raised. We will revise the abstract to reference these additional metrics explicitly and add a brief discussion in the experiments section on the LLM judge's use as a proxy, noting the lack of human validation as a limitation to be addressed in future work. revision: yes
Referee: [Abstract] Abstract: The motivation rests on the assertions that (1) CoT consistency depends on sparse pivot tokens and (2) unimodal pruning methods ignore visual/textual activation differences. These are presented as the reasons existing methods fail, yet the abstract supplies no supporting analysis, ablation, or measurement showing these factors are load-bearing for the observed performance gaps.

Authors: The abstract's brevity limits detailed support, but the full manuscript provides empirical analysis of activation patterns identifying pivot tokens and cross-modal differences in Sections 3 and 4. We agree the abstract would benefit from tighter linkage. We will revise the abstract to include a concise phrase referencing these observations from our preliminary analysis, thereby better grounding the motivation without altering the manuscript's core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark comparisons without self-referential derivations or fitted predictions.

full rationale

The paper proposes MuCRASP as a structured pruning method motivated by two observations about CoT pivot tokens and cross-modal activations, then reports experimental results on VLMs and benchmarks using LLM-as-a-Judge scores. No equations, derivations, or parameter-fitting steps appear in the abstract or described content that reduce the performance claims to inputs by construction. Claims of superiority are supported by direct comparisons (e.g., 8.87 vs 7.32 scores) rather than any self-definitional, self-citation load-bearing, or ansatz-smuggling mechanisms. The derivation chain is self-contained as an empirical framework without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities is provided in the source material.

pith-pipeline@v0.9.1-grok · 5758 in / 1310 out tokens · 42052 ms · 2026-06-29T21:10:25.131363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 17 canonical work pages · 11 internal anchors

[1]

Explicit reasoning over end-to-end neural architectures for visual question answering, 2018

Somak Aditya, Yezhou Yang, and Chitta Baral. Explicit reasoning over end-to-end neural architectures for visual question answering, 2018. URL https://arxiv.org/abs/1803. 08896

2018
[2]

Spatial knowledge distillation to aid visual reasoning

Somak Aditya, Rudra Saha, Yezhou Yang, and Chitta Baral. Spatial knowledge distillation to aid visual reasoning. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 227–235. IEEE, 2019

2019
[3]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yinfei Yang, Siddhartha Jayanti, and Santiago Ontañón. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023

2023
[4]

Fluctuation-based adaptive structured pruning for large language models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InAAAI, 2024

2024
[5]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. In ICLR, 2024

2024
[6]

Humans or llms as the judge? a study on judgement biases, 2024

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, 2024. URL https://arxiv.org/abs/2402. 10669

2024
[7]

Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning.arXiv preprint arXiv:2506.05331, 2025

work page arXiv 2025
[8]

Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks

Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, and Somak Aditya. Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10543–10563, 2025

2025
[9]

SparseGPT: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InICML, 2023

2023
[10]

Gemma 3: Open models for responsible AI

Gemma Team, Google DeepMind. Gemma 3: Open models for responsible AI. Technical report, 2025

2025
[11]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J

Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel method for the two-sample problem. InNeurIPS, 2006

2006
[13]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13:723–773, 2012

2012
[14]

Song Han, Jeff Pool, John Tung, and William J. Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015

2015
[15]

Stork, and Gregory J

Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, 1993

1993
[16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

IVTP: Instruction-guided visual token pruning for large vision-language models

Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. IVTP: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024

2024
[19]

An analysis of visual question answering algorithms

Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In ICCV, 2017

2017
[20]

Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URLhttps://arxiv.org/abs/2402.02834

work page arXiv 2024
[21]

TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025

Minsu Kim, Jiwan Chung, et al. TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025

work page arXiv 2025
[22]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022

2022
[23]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, 12 Timot...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNeurIPS, 1989

1989
[25]

Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021

Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021

work page arXiv 2010
[26]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, et al. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

2022
[28]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apid- ianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors,Proceedings of the 13th International Joint Conference on Natural Language Pro- cessin...
[29]

doi: 10.18653/v1/2023.ijcnlp-main.20

Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.20. URL https://aclanthology.org/2023.ijcnlp-main.20/

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023
[30]

LLM-Pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. InNeurIPS, 2023

2023
[31]

LLM-Pruner: Support for GQA and Llama-3

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: Support for GQA and Llama-3. https://github.com/horseee/LLM-Pruner, 2024

2024
[32]

OK-VQA: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

2019
[33]

Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990

Silvano Martello and Paolo Toth. Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990

1990
[34]

Are sixteen heads really better than one? In NeurIPS, 2019

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019

2019
[35]

Pruning convolutional neural networks for resource efficient inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InICLR, 2017

2017
[36]

Importance estimation for neural network pruning

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InCVPR, 2019

2019
[37]

Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution

Qwen Team. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. Technical report, 2024

2024
[38]

Qwen2.5-VL: Technical report

Qwen Team. Qwen2.5-VL: Technical report. Technical report, 2025

2025
[39]

A-OKVQA: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022

2022
[40]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024
[41]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/ 2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[42]

Zico Kolter

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InICLR, 2024. 13

2024
[43]

ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models

Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models. InICLR, 2024

2024
[44]

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[45]

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InACL, 2019

2019
[46]

Exploring intrinsic dimension for vision- language model pruning

Hanzhang Wang, Jiawen Zhang, and Qingyuan Ma. Exploring intrinsic dimension for vision- language model pruning. InICML, 2024

2024
[47]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[48]

Task-specific compression for multi-task language models using attribution-based pruning

Nakyeong Yang, Yunah Jang, Hwanhee Lee, Seohyeong Jeong, and Kyomin Jung. Task-specific compression for multi-task language models using attribution-based pruning. InFindings of EACL, 2023

2023
[49]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

ATP-LLaV A: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InarXiv preprint arXiv:2412.00447, 2024

work page arXiv 2024
[51]

Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, et al. Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity. InICML, 2024

2024
[52]

Chain of preference optimization: Improving chain-of-thought reasoning in LLMs

Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in LLMs. InNeurIPS, 2024

2024
[53]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Multimodal chain-of-thought reasoning in language models

Zhuosheng Zhang, Aston Zhang, Mu Li, et al. Multimodal chain-of-thought reasoning in language models. InACL, 2023

2023
[55]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: Exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017. 14 A Ablation studies A.1µCRASP: Is each component necessary? We systematically ablate each component of µCRASP on Qwen2.5-VL-7B at 50% pruning, an aggressive regime that exposes the contribution of each me...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

I observe

Answer (if the answer is incorrect, ignore it and generate the correct answer based on the image and question) Please generate a reasoning chain with 2–3 concise steps where needed, referencing specific visual cues (“I observe...”, “Because I see...”), then provide the final answer. Format: Step-by-Step Thought:
[58]

one”, “two

Therefore ... deduced_answer: {final_answer} Constraints: • Since this is a quantitative reasoning question, write numbers in words (e.g., “one”, “two”) We generate CoT traces using greedy decoding (temperature = 0) to ensure reproducibility. To ensure annotation quality, all generated CoT traces were manually reviewed by two human annotators, including o...
[59]

A question about the image
[60]

A generated chain-of-thought (CoT) reasoning trace
[61]

How many cars are there?

A final predicted answer Your task is to evaluate the quality of the reasoning trace along two dimensions: • Grounding Score (0–10):Evaluate how well the reasoning is supported by observ- able visual evidence in the image. High scores indicate that the reasoning correctly references visible objects, attributes, spatial relations, and scene details. • Hall...
[62]

I observe the player in the foreground wearing a white shirt with the number “15” on it
[63]

The player appears to be leading the action because they are closest to the ball and actively engaged in play
[64]

Final Answer: white

Therefore, the color of the lead player’s shirt is white. Final Answer: white. µCRASP (Ours): Step-by-Step Thought:
[65]

Identify the lead player in the scene
[66]

The lead player appears to be the individual closest to the ball
[67]

Final Answer: The color of the lead player’s shirt is white

The player is wearing a white shirt. Final Answer: The color of the lead player’s shirt is white. FLAP: The generated response becomes fragmented and semantically incoherent, failing to correctly identify the player’s shirt color. ...[truncated] LLM-Pruner: The image shows a football player in action, likely within a soccer match scenario. However, the ge...
[68]

New York City Bus

I observe the text “New York City Bus” displayed on the rear of the bus, indicating that the bus operates in New York City
[69]

The architecture of the surrounding buildings and the design of the bus are also consistent with a typical New York City street environment
[70]

New York City Bus,

Therefore, the most likely setting for this street is New York City. Final Answer: New York. µCRASP (Ours): The image contains a bus with signage indicating “New York City Bus, ” which strongly suggests that the scene takes place in New York City. The surrounding urban architecture and overall street layout are also visually consistent with a dense metrop...
[71]

I observe a bright yellow-green object in the air above the person, which appears to be a frisbee
[72]

The frisbee’s colour is distinctly greenish-yellow, matching the typical appearance of a frisbee
[73]

Limitations

Therefore, the colour of the Frisbee is green. Final Answer: green. µCRASP (Ours): The frisbee in the image appears to be yellow with some green markings, which is typical of a standard frisbee used in beach settings or similar environments. Final colour: Yellow with green markings. FLAP: The light brown object being held suggests the person’s clothing, w...
[74]

I observe four individuals seated in a ski lift chair
[75]

Each person is wearing winter clothing and appears to be enjoying the ride
[76]

Final Answer: four

Therefore, the total number of people in the picture is four. Final Answer: four. µCRASP (Ours): There are four people in the picture. The image shows a group of individuals sitting together on a ski lift, which is a common mode of transport used to carry skiers or snow riders. Each person appears to be wearing winter or snow gear, and they are all facing...
[77]

I observe three forks in the image: one placed on the left side of the pizza plate, another on the right side of the pizza plate, and a third one near the bottom right corner of the image
[78]

I infer that these are all distinct forks visible in the photo, as they are positioned separately and there are no overlapping or hidden forks
[79]

Final Answer: three

Therefore, the total count of forks in the photo is three. Final Answer: three. µCRASP (Ours): There are three forks visible in the photo. One fork is placed on the plate to the left side of the pizza, another fork is on the top right plate near the edge of the table, and a third fork is partially visible on the bottom right plate, extending from the edge...
[80]

I observe a man sitting on a red vehicle with large wheels and handlebars, which is characteristic of an all-terrain vehicle (ATV)

Showing first 80 references.

[1] [1]

Explicit reasoning over end-to-end neural architectures for visual question answering, 2018

Somak Aditya, Yezhou Yang, and Chitta Baral. Explicit reasoning over end-to-end neural architectures for visual question answering, 2018. URL https://arxiv.org/abs/1803. 08896

2018

[2] [2]

Spatial knowledge distillation to aid visual reasoning

Somak Aditya, Rudra Saha, Yezhou Yang, and Chitta Baral. Spatial knowledge distillation to aid visual reasoning. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 227–235. IEEE, 2019

2019

[3] [3]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yinfei Yang, Siddhartha Jayanti, and Santiago Ontañón. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023

2023

[4] [4]

Fluctuation-based adaptive structured pruning for large language models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InAAAI, 2024

2024

[5] [5]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. In ICLR, 2024

2024

[6] [6]

Humans or llms as the judge? a study on judgement biases, 2024

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, 2024. URL https://arxiv.org/abs/2402. 10669

2024

[7] [7]

Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning.arXiv preprint arXiv:2506.05331, 2025

work page arXiv 2025

[8] [8]

Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks

Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, and Somak Aditya. Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10543–10563, 2025

2025

[9] [9]

SparseGPT: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InICML, 2023

2023

[10] [10]

Gemma 3: Open models for responsible AI

Gemma Team, Google DeepMind. Gemma 3: Open models for responsible AI. Technical report, 2025

2025

[11] [11]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J

Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel method for the two-sample problem. InNeurIPS, 2006

2006

[13] [13]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13:723–773, 2012

2012

[14] [14]

Song Han, Jeff Pool, John Tung, and William J. Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015

2015

[15] [15]

Stork, and Gregory J

Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, 1993

1993

[16] [16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

IVTP: Instruction-guided visual token pruning for large vision-language models

Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. IVTP: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024

2024

[19] [19]

An analysis of visual question answering algorithms

Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In ICCV, 2017

2017

[20] [20]

Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URLhttps://arxiv.org/abs/2402.02834

work page arXiv 2024

[21] [21]

TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025

Minsu Kim, Jiwan Chung, et al. TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025

work page arXiv 2025

[22] [22]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022

2022

[23] [23]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, 12 Timot...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNeurIPS, 1989

1989

[25] [25]

Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021

Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021

work page arXiv 2010

[26] [26]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, et al. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

2022

[28] [28]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apid- ianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors,Proceedings of the 13th International Joint Conference on Natural Language Pro- cessin...

[29] [29]

doi: 10.18653/v1/2023.ijcnlp-main.20

Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.20. URL https://aclanthology.org/2023.ijcnlp-main.20/

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023

[30] [30]

LLM-Pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. InNeurIPS, 2023

2023

[31] [31]

LLM-Pruner: Support for GQA and Llama-3

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: Support for GQA and Llama-3. https://github.com/horseee/LLM-Pruner, 2024

2024

[32] [32]

OK-VQA: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

2019

[33] [33]

Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990

Silvano Martello and Paolo Toth. Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990

1990

[34] [34]

Are sixteen heads really better than one? In NeurIPS, 2019

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019

2019

[35] [35]

Pruning convolutional neural networks for resource efficient inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InICLR, 2017

2017

[36] [36]

Importance estimation for neural network pruning

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InCVPR, 2019

2019

[37] [37]

Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution

Qwen Team. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. Technical report, 2024

2024

[38] [38]

Qwen2.5-VL: Technical report

Qwen Team. Qwen2.5-VL: Technical report. Technical report, 2025

2025

[39] [39]

A-OKVQA: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022

2022

[40] [40]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024

[41] [41]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/ 2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020

[42] [42]

Zico Kolter

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InICLR, 2024. 13

2024

[43] [43]

ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models

Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models. InICLR, 2024

2024

[44] [44]

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000

[45] [45]

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InACL, 2019

2019

[46] [46]

Exploring intrinsic dimension for vision- language model pruning

Hanzhang Wang, Jiawen Zhang, and Qingyuan Ma. Exploring intrinsic dimension for vision- language model pruning. InICML, 2024

2024

[47] [47]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[48] [48]

Task-specific compression for multi-task language models using attribution-based pruning

Nakyeong Yang, Yunah Jang, Hwanhee Lee, Seohyeong Jeong, and Kyomin Jung. Task-specific compression for multi-task language models using attribution-based pruning. InFindings of EACL, 2023

2023

[49] [49]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

ATP-LLaV A: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InarXiv preprint arXiv:2412.00447, 2024

work page arXiv 2024

[51] [51]

Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, et al. Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity. InICML, 2024

2024

[52] [52]

Chain of preference optimization: Improving chain-of-thought reasoning in LLMs

Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in LLMs. InNeurIPS, 2024

2024

[53] [53]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Multimodal chain-of-thought reasoning in language models

Zhuosheng Zhang, Aston Zhang, Mu Li, et al. Multimodal chain-of-thought reasoning in language models. InACL, 2023

2023

[55] [55]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: Exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017. 14 A Ablation studies A.1µCRASP: Is each component necessary? We systematically ablate each component of µCRASP on Qwen2.5-VL-7B at 50% pruning, an aggressive regime that exposes the contribution of each me...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [57]

I observe

Answer (if the answer is incorrect, ignore it and generate the correct answer based on the image and question) Please generate a reasoning chain with 2–3 concise steps where needed, referencing specific visual cues (“I observe...”, “Because I see...”), then provide the final answer. Format: Step-by-Step Thought:

[58] [58]

one”, “two

Therefore ... deduced_answer: {final_answer} Constraints: • Since this is a quantitative reasoning question, write numbers in words (e.g., “one”, “two”) We generate CoT traces using greedy decoding (temperature = 0) to ensure reproducibility. To ensure annotation quality, all generated CoT traces were manually reviewed by two human annotators, including o...

[59] [59]

A question about the image

[60] [60]

A generated chain-of-thought (CoT) reasoning trace

[61] [61]

How many cars are there?

A final predicted answer Your task is to evaluate the quality of the reasoning trace along two dimensions: • Grounding Score (0–10):Evaluate how well the reasoning is supported by observ- able visual evidence in the image. High scores indicate that the reasoning correctly references visible objects, attributes, spatial relations, and scene details. • Hall...

[62] [62]

I observe the player in the foreground wearing a white shirt with the number “15” on it

[63] [63]

The player appears to be leading the action because they are closest to the ball and actively engaged in play

[64] [64]

Final Answer: white

Therefore, the color of the lead player’s shirt is white. Final Answer: white. µCRASP (Ours): Step-by-Step Thought:

[65] [65]

Identify the lead player in the scene

[66] [66]

The lead player appears to be the individual closest to the ball

[67] [67]

Final Answer: The color of the lead player’s shirt is white

The player is wearing a white shirt. Final Answer: The color of the lead player’s shirt is white. FLAP: The generated response becomes fragmented and semantically incoherent, failing to correctly identify the player’s shirt color. ...[truncated] LLM-Pruner: The image shows a football player in action, likely within a soccer match scenario. However, the ge...

[68] [68]

New York City Bus

I observe the text “New York City Bus” displayed on the rear of the bus, indicating that the bus operates in New York City

[69] [69]

The architecture of the surrounding buildings and the design of the bus are also consistent with a typical New York City street environment

[70] [70]

New York City Bus,

Therefore, the most likely setting for this street is New York City. Final Answer: New York. µCRASP (Ours): The image contains a bus with signage indicating “New York City Bus, ” which strongly suggests that the scene takes place in New York City. The surrounding urban architecture and overall street layout are also visually consistent with a dense metrop...

[71] [71]

I observe a bright yellow-green object in the air above the person, which appears to be a frisbee

[72] [72]

The frisbee’s colour is distinctly greenish-yellow, matching the typical appearance of a frisbee

[73] [73]

Limitations

Therefore, the colour of the Frisbee is green. Final Answer: green. µCRASP (Ours): The frisbee in the image appears to be yellow with some green markings, which is typical of a standard frisbee used in beach settings or similar environments. Final colour: Yellow with green markings. FLAP: The light brown object being held suggests the person’s clothing, w...

[74] [74]

I observe four individuals seated in a ski lift chair

[75] [75]

Each person is wearing winter clothing and appears to be enjoying the ride

[76] [76]

Final Answer: four

Therefore, the total number of people in the picture is four. Final Answer: four. µCRASP (Ours): There are four people in the picture. The image shows a group of individuals sitting together on a ski lift, which is a common mode of transport used to carry skiers or snow riders. Each person appears to be wearing winter or snow gear, and they are all facing...

[77] [77]

I observe three forks in the image: one placed on the left side of the pizza plate, another on the right side of the pizza plate, and a third one near the bottom right corner of the image

[78] [78]

I infer that these are all distinct forks visible in the photo, as they are positioned separately and there are no overlapping or hidden forks

[79] [79]

Final Answer: three

Therefore, the total count of forks in the photo is three. Final Answer: three. µCRASP (Ours): There are three forks visible in the photo. One fork is placed on the plate to the left side of the pizza, another fork is on the top right plate near the edge of the table, and a third fork is partially visible on the bottom right plate, extending from the edge...

[80] [80]

I observe a man sitting on a red vehicle with large wheels and handlebars, which is characteristic of an all-terrain vehicle (ATV)