pith. sign in

arxiv: 2605.25842 · v2 · pith:5NRPO766new · submitted 2026-05-25 · 💻 cs.AI · cs.CL

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Pith reviewed 2026-06-29 21:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords structured pruningchain-of-thought reasoningvision-language modelsmodel compressionmultimodal reasoningparameter pruningreasoning preservation
0
0 comments X

The pith

MuCRASP prunes vision-language models by targeting CoT pivot tokens and modality-specific activations to preserve reasoning accuracy under compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard structured pruning removes parameters without regard for the sparse transition points that sustain chain-of-thought consistency in multimodal generation. It further claims that visual and textual activations follow different distributions, so pruning rules derived from text-only models damage cross-modal alignment. MuCRASP therefore selects which components to remove by combining reasoning-critical identification, cross-modal preservation, and layer-wise sensitivity within one global parameter budget. If this selection rule works, VLMs could be made substantially smaller while still solving physical-reasoning and other step-by-step multimodal tasks at close to their original quality.

Core claim

MuCRASP is a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget; experiments on four VLMs across three reasoning benchmarks show it consistently preserves reasoning quality under increasing compression, maintains high reasoning consistency up to 50 percent pruning, and exhibits lower perplexity degradation than prior methods.

What carries the argument

MuCRASP, the structured pruning framework that identifies and protects reasoning-critical components by locating pivot tokens in CoT trajectories and respecting activation-distribution differences between visual and textual modalities under a single global budget.

If this is right

  • Reasoning consistency remains high even after 50 percent of parameters are removed.
  • Perplexity rises more slowly than under CoT-agnostic pruning.
  • The same global budget can be allocated across layers according to measured sensitivity rather than uniform ratios.
  • Cross-modal alignment is retained because pruning decisions explicitly consider both visual and textual activation statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pivot-token logic could be applied to prune models that generate long chains of tool calls or multi-step plans in non-visual domains.
  • If pivot tokens prove stable across tasks, a one-time identification pass might suffice for repeated pruning at different compression targets.
  • Layer-wise sensitivity maps computed once on a calibration set could guide pruning for families of related VLMs without re-running the full search.

Load-bearing premise

CoT consistency in VLMs depends on sparse pivot tokens whose removal by existing pruning methods destroys accuracy, and visual versus textual activations differ enough that unimodal pruning rules cannot handle them.

What would settle it

A side-by-side evaluation on the same physical-reasoning benchmark and the same Qwen2.5-VL-7B model at 30 percent pruning in which MuCRASP does not produce a higher LLM-as-a-Judge score than the strongest baseline.

Figures

Figures reproduced from arXiv: 2605.25842 by Aritra Dutta, Somak Aditya.

Figure 1
Figure 1. Figure 1: Overview of µCRASP. Gradient attribution and trajectory-pivot attribution provide complementary pruning signals, which are fused using a pruning-ratio-dependent coefficient γdyn. Per-layer cross-modal dependency (CMDS) and output sensitivity define a protection factor pf that preserves vision-language integration layers under aggressive pruning. The resulting importance scores are used by a greedy knapsack… view at source ↗
Figure 2
Figure 2. Figure 2: Performance across pruning methods under increasing pruning level on Qwen2.5-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) MLP neuron retention after 40% pruning (bars) overlaid with per-layer MLP CMDS (red line). Layers in the peak CMDS region (11 − 18) retain substantially more neurons. (b) Sliding window ablation on Qwen2.5-VL-7B Physical: zeroing out four contiguous layers at each position. Ablating low-CMDS layers (4-7) causes only moderate degradation, whereas ablating high-CMDS layers (12-15, 16-19) triggers catastr… view at source ↗
Figure 4
Figure 4. Figure 4: Per-token KL divergence distributions (DKL(dense∥pruned)) on Qwen2.5-VL-7B at 45% pruning across all three reasoning domains. Dashed vertical lines mark per-method means.µCRASP concentrates KL mass near zero (mean 0.92 nats) in every domain, while baselines exhibit broader and right-shifted distributions. The near-identical distributional structure across Physical, Commonsense, and Quantity confirms that µ… view at source ↗
Figure 5
Figure 5. Figure 5: Compression–performance curves for LLAMA 3.2 11B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Compression–performance curves for Qwen2.5-VL-7B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Compression–performance curves for Gemma-3-4B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Compression–performance curves for Qwen2.5-VL-3B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Compression–performance curves for Qwen2-VL-2B across three reasoning domains. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of reasoning traces under structured pruning on player colour [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of reasoning traces under structured pruning on urban scene [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison of reasoning traces under structured pruning. [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison of reasoning traces under structured pruning on counting-based [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison of reasoning traces under structured pruning on object counting [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison of reasoning traces under structured pruning on vehicle identifica [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison of reasoning traces under structured pruning on fruit counting [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison of reasoning traces under structured pruning on colour identifica [PITH_FULL_IMAGE:figures/full_fig_p045_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative comparison of reasoning traces under structured pruning on traffic-sign [PITH_FULL_IMAGE:figures/full_fig_p046_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative comparison of reasoning traces under structured pruning on contextual mascot [PITH_FULL_IMAGE:figures/full_fig_p047_19.png] view at source ↗
read the original abstract

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MuCRASP, a structured pruning framework for vision-language models that incorporates awareness of chain-of-thought (CoT) reasoning. It identifies two failure modes in prior pruning methods—ignorance of sparse pivot tokens critical for CoT consistency and failure to handle cross-modal activation differences—and designs pruning to target reasoning-critical components while preserving cross-modal alignment under a global parameter budget with layer-wise sensitivity. Experiments across four VLMs and three reasoning benchmarks claim that MuCRASP preserves reasoning quality better than baselines under compression, including an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline at 30% pruning on Qwen2.5-VL-7B physical reasoning tasks, with maintained consistency up to 50% pruning and lower perplexity degradation.

Significance. If the results hold under rigorous evaluation, the work addresses a practical barrier to deploying reasoning-capable VLMs by offering a pruning approach that better maintains CoT performance than existing methods, potentially enabling more efficient multimodal systems.

major comments (2)
  1. [Abstract] Abstract: The sole quantitative evidence for the central claim that MuCRASP 'preserves reasoning quality' and 'significantly outperforming prior pruning approaches' is the LLM-as-a-Judge score (8.87 vs. 7.32 at 30% pruning). No correlation with human judgments, exact-match task accuracy, or inter-judge agreement is reported, so it is unclear whether the gap demonstrates true CoT preservation or merely stylistic preference by the judge model.
  2. [Abstract] Abstract: The motivation rests on the assertions that (1) CoT consistency depends on sparse pivot tokens and (2) unimodal pruning methods ignore visual/textual activation differences. These are presented as the reasons existing methods fail, yet the abstract supplies no supporting analysis, ablation, or measurement showing these factors are load-bearing for the observed performance gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will make revisions to strengthen the presentation of evidence and motivation while remaining faithful to the manuscript's content.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The sole quantitative evidence for the central claim that MuCRASP 'preserves reasoning quality' and 'significantly outperforming prior pruning approaches' is the LLM-as-a-Judge score (8.87 vs. 7.32 at 30% pruning). No correlation with human judgments, exact-match task accuracy, or inter-judge agreement is reported, so it is unclear whether the gap demonstrates true CoT preservation or merely stylistic preference by the judge model.

    Authors: We acknowledge the referee's point that the abstract centers the LLM-as-a-Judge score for the reasoning preservation claim. The manuscript does report supplementary metrics including lower perplexity degradation and maintained reasoning consistency up to 50% pruning. However, we agree that the absence of reported human correlation, exact-match accuracy, or inter-judge agreement leaves the interpretation of the judge score open to the concern raised. We will revise the abstract to reference these additional metrics explicitly and add a brief discussion in the experiments section on the LLM judge's use as a proxy, noting the lack of human validation as a limitation to be addressed in future work. revision: yes

  2. Referee: [Abstract] Abstract: The motivation rests on the assertions that (1) CoT consistency depends on sparse pivot tokens and (2) unimodal pruning methods ignore visual/textual activation differences. These are presented as the reasons existing methods fail, yet the abstract supplies no supporting analysis, ablation, or measurement showing these factors are load-bearing for the observed performance gaps.

    Authors: The abstract's brevity limits detailed support, but the full manuscript provides empirical analysis of activation patterns identifying pivot tokens and cross-modal differences in Sections 3 and 4. We agree the abstract would benefit from tighter linkage. We will revise the abstract to include a concise phrase referencing these observations from our preliminary analysis, thereby better grounding the motivation without altering the manuscript's core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark comparisons without self-referential derivations or fitted predictions.

full rationale

The paper proposes MuCRASP as a structured pruning method motivated by two observations about CoT pivot tokens and cross-modal activations, then reports experimental results on VLMs and benchmarks using LLM-as-a-Judge scores. No equations, derivations, or parameter-fitting steps appear in the abstract or described content that reduce the performance claims to inputs by construction. Claims of superiority are supported by direct comparisons (e.g., 8.87 vs 7.32 scores) rather than any self-definitional, self-citation load-bearing, or ansatz-smuggling mechanisms. The derivation chain is self-contained as an empirical framework without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities is provided in the source material.

pith-pipeline@v0.9.1-grok · 5758 in / 1310 out tokens · 42052 ms · 2026-06-29T21:10:25.131363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 17 canonical work pages · 11 internal anchors

  1. [1]

    Explicit reasoning over end-to-end neural architectures for visual question answering, 2018

    Somak Aditya, Yezhou Yang, and Chitta Baral. Explicit reasoning over end-to-end neural architectures for visual question answering, 2018. URL https://arxiv.org/abs/1803. 08896

  2. [2]

    Spatial knowledge distillation to aid visual reasoning

    Somak Aditya, Rudra Saha, Yezhou Yang, and Chitta Baral. Spatial knowledge distillation to aid visual reasoning. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 227–235. IEEE, 2019

  3. [3]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yinfei Yang, Siddhartha Jayanti, and Santiago Ontañón. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023

  4. [4]

    Fluctuation-based adaptive structured pruning for large language models

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InAAAI, 2024

  5. [5]

    Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. In ICLR, 2024

  6. [6]

    Humans or llms as the judge? a study on judgement biases, 2024

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, 2024. URL https://arxiv.org/abs/2402. 10669

  7. [7]

    Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning

    Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning.arXiv preprint arXiv:2506.05331, 2025

  8. [8]

    Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks

    Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, and Somak Aditya. Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10543–10563, 2025

  9. [9]

    SparseGPT: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InICML, 2023

  10. [10]

    Gemma 3: Open models for responsible AI

    Gemma Team, Google DeepMind. Gemma 3: Open models for responsible AI. Technical report, 2025

  11. [11]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  12. [12]

    Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J

    Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel method for the two-sample problem. InNeurIPS, 2006

  13. [13]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13:723–773, 2012

  14. [14]

    Song Han, Jeff Pool, John Tung, and William J. Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015

  15. [15]

    Stork, and Gregory J

    Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, 1993

  16. [16]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  17. [17]

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301, 2023

  18. [18]

    IVTP: Instruction-guided visual token pruning for large vision-language models

    Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. IVTP: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024

  19. [19]

    An analysis of visual question answering algorithms

    Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In ICCV, 2017

  20. [20]

    Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024

    Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URLhttps://arxiv.org/abs/2402.02834

  21. [21]

    TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025

    Minsu Kim, Jiwan Chung, et al. TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025

  22. [22]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022

  23. [23]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, 12 Timot...

  24. [24]

    Denker, and Sara A

    Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNeurIPS, 1989

  25. [25]

    Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021

    Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, et al. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  27. [27]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

  28. [28]

    Faithful chain-of-thought reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apid- ianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors,Proceedings of the 13th International Joint Conference on Natural Language Pro- cessin...

  29. [29]

    doi: 10.18653/v1/2023.ijcnlp-main.20

    Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.20. URL https://aclanthology.org/2023.ijcnlp-main.20/

  30. [30]

    LLM-Pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. InNeurIPS, 2023

  31. [31]

    LLM-Pruner: Support for GQA and Llama-3

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: Support for GQA and Llama-3. https://github.com/horseee/LLM-Pruner, 2024

  32. [32]

    OK-VQA: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

  33. [33]

    Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990

    Silvano Martello and Paolo Toth. Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990

  34. [34]

    Are sixteen heads really better than one? In NeurIPS, 2019

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019

  35. [35]

    Pruning convolutional neural networks for resource efficient inference

    Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InICLR, 2017

  36. [36]

    Importance estimation for neural network pruning

    Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InCVPR, 2019

  37. [37]

    Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution

    Qwen Team. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. Technical report, 2024

  38. [38]

    Qwen2.5-VL: Technical report

    Qwen Team. Qwen2.5-VL: Technical report. Technical report, 2025

  39. [39]

    A-OKVQA: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022

  40. [40]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  41. [41]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/ 2002.05202

  42. [42]

    Zico Kolter

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InICLR, 2024. 13

  43. [43]

    ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models

    Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models. InICLR, 2024

  44. [44]

    The information bottleneck method

    Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

  45. [45]

    Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

    Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InACL, 2019

  46. [46]

    Exploring intrinsic dimension for vision- language model pruning

    Hanzhang Wang, Jiawen Zhang, and Qingyuan Ma. Exploring intrinsic dimension for vision- language model pruning. InICML, 2024

  47. [47]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  48. [48]

    Task-specific compression for multi-task language models using attribution-based pruning

    Nakyeong Yang, Yunah Jang, Hwanhee Lee, Seohyeong Jeong, and Kyomin Jung. Task-specific compression for multi-task language models using attribution-based pruning. InFindings of EACL, 2023

  49. [49]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

  50. [50]

    ATP-LLaV A: Adaptive token pruning for large vision language models

    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InarXiv preprint arXiv:2412.00447, 2024

  51. [51]

    Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity

    Lu Yin, You Wu, Zhenyu Zhang, et al. Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity. InICML, 2024

  52. [52]

    Chain of preference optimization: Improving chain-of-thought reasoning in LLMs

    Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in LLMs. InNeurIPS, 2024

  53. [53]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  54. [54]

    Multimodal chain-of-thought reasoning in language models

    Zhuosheng Zhang, Aston Zhang, Mu Li, et al. Multimodal chain-of-thought reasoning in language models. InACL, 2023

  55. [55]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

  56. [56]

    To prune, or not to prune: exploring the efficacy of pruning for model compression

    Michael Zhu and Suyog Gupta. To prune, or not to prune: Exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017. 14 A Ablation studies A.1µCRASP: Is each component necessary? We systematically ablate each component of µCRASP on Qwen2.5-VL-7B at 50% pruning, an aggressive regime that exposes the contribution of each me...

  57. [57]

    I observe

    Answer (if the answer is incorrect, ignore it and generate the correct answer based on the image and question) Please generate a reasoning chain with 2–3 concise steps where needed, referencing specific visual cues (“I observe...”, “Because I see...”), then provide the final answer. Format: Step-by-Step Thought:

  58. [58]

    one”, “two

    Therefore ... deduced_answer: {final_answer} Constraints: • Since this is a quantitative reasoning question, write numbers in words (e.g., “one”, “two”) We generate CoT traces using greedy decoding (temperature = 0) to ensure reproducibility. To ensure annotation quality, all generated CoT traces were manually reviewed by two human annotators, including o...

  59. [59]

    A question about the image

  60. [60]

    A generated chain-of-thought (CoT) reasoning trace

  61. [61]

    How many cars are there?

    A final predicted answer Your task is to evaluate the quality of the reasoning trace along two dimensions: • Grounding Score (0–10):Evaluate how well the reasoning is supported by observ- able visual evidence in the image. High scores indicate that the reasoning correctly references visible objects, attributes, spatial relations, and scene details. • Hall...

  62. [62]

    I observe the player in the foreground wearing a white shirt with the number “15” on it

  63. [63]

    The player appears to be leading the action because they are closest to the ball and actively engaged in play

  64. [64]

    Final Answer: white

    Therefore, the color of the lead player’s shirt is white. Final Answer: white. µCRASP (Ours): Step-by-Step Thought:

  65. [65]

    Identify the lead player in the scene

  66. [66]

    The lead player appears to be the individual closest to the ball

  67. [67]

    Final Answer: The color of the lead player’s shirt is white

    The player is wearing a white shirt. Final Answer: The color of the lead player’s shirt is white. FLAP: The generated response becomes fragmented and semantically incoherent, failing to correctly identify the player’s shirt color. ...[truncated] LLM-Pruner: The image shows a football player in action, likely within a soccer match scenario. However, the ge...

  68. [68]

    New York City Bus

    I observe the text “New York City Bus” displayed on the rear of the bus, indicating that the bus operates in New York City

  69. [69]

    The architecture of the surrounding buildings and the design of the bus are also consistent with a typical New York City street environment

  70. [70]

    New York City Bus,

    Therefore, the most likely setting for this street is New York City. Final Answer: New York. µCRASP (Ours): The image contains a bus with signage indicating “New York City Bus, ” which strongly suggests that the scene takes place in New York City. The surrounding urban architecture and overall street layout are also visually consistent with a dense metrop...

  71. [71]

    I observe a bright yellow-green object in the air above the person, which appears to be a frisbee

  72. [72]

    The frisbee’s colour is distinctly greenish-yellow, matching the typical appearance of a frisbee

  73. [73]

    Limitations

    Therefore, the colour of the Frisbee is green. Final Answer: green. µCRASP (Ours): The frisbee in the image appears to be yellow with some green markings, which is typical of a standard frisbee used in beach settings or similar environments. Final colour: Yellow with green markings. FLAP: The light brown object being held suggests the person’s clothing, w...

  74. [74]

    I observe four individuals seated in a ski lift chair

  75. [75]

    Each person is wearing winter clothing and appears to be enjoying the ride

  76. [76]

    Final Answer: four

    Therefore, the total number of people in the picture is four. Final Answer: four. µCRASP (Ours): There are four people in the picture. The image shows a group of individuals sitting together on a ski lift, which is a common mode of transport used to carry skiers or snow riders. Each person appears to be wearing winter or snow gear, and they are all facing...

  77. [77]

    I observe three forks in the image: one placed on the left side of the pizza plate, another on the right side of the pizza plate, and a third one near the bottom right corner of the image

  78. [78]

    I infer that these are all distinct forks visible in the photo, as they are positioned separately and there are no overlapping or hidden forks

  79. [79]

    Final Answer: three

    Therefore, the total count of forks in the photo is three. Final Answer: three. µCRASP (Ours): There are three forks visible in the photo. One fork is placed on the plate to the left side of the pizza, another fork is on the top right plate near the edge of the table, and a third fork is partially visible on the bottom right plate, extending from the edge...

  80. [80]

    I observe a man sitting on a red vehicle with large wheels and handlebars, which is characteristic of an all-terrain vehicle (ATV)

Showing first 80 references.