pith. sign in

arxiv: 2506.01097 · v2 · submitted 2025-06-01 · 💻 cs.CV

Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective

Pith reviewed 2026-05-19 10:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsvisual token compressionexplainabilityattention mapstask-related pruninginference efficiencymodel-agnostic design
0
0 comments X

The pith

Task-related visual token compression works at the input stage of multimodal LLMs when guided by explainability scores, with negligible performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that visual tokens can be pruned for relevance to the user instruction right at the start of LLM processing rather than in intermediate layers. Explainability techniques measure each token's global importance to the task, and a lightweight convolutional network approximates those scores from the first layer's attention map alone. This early selection removes many task-irrelevant tokens while preserving model output quality. The approach requires no changes to the underlying LLM and trains independently of it. Tests on thirteen image and video benchmarks with three different MLLMs show reduced computation, shorter prefilling times, and smaller memory use during inference.

Core claim

Explainability methods for transformer architectures can evaluate the global importance of each visual token with respect to a given instruction; this importance can be learned as a mapping from the first LLM layer's attention map by a simple lightweight convolutional network, enabling effective task-related token compression directly at the LLM input stage without full inference or any modification to the model architecture.

What carries the argument

Lightweight convolutional network that maps first-layer attention maps to global token importance scores obtained from explainability methods.

If this is right

  • Fewer visual tokens reach the LLM, directly lowering computational costs during both training and inference.
  • Prefilling time decreases and KV cache memory usage shrinks because irrelevant tokens are removed early.
  • The method applies to existing MLLMs without any architecture changes or retraining of the core model.
  • Performance holds across image and video tasks on multiple leading MLLMs, showing broad applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-attention mapping idea could be tested on audio or text tokens in other multimodal settings.
  • Hardware-constrained deployments might gain larger effective context windows by adopting this pruning step.
  • Combining the approach with later-stage optimizations like quantization could compound efficiency gains.

Load-bearing premise

Explainability methods can accurately assess the global importance of individual visual tokens relative to an instruction, and this importance is reliably predictable from first-layer attention maps alone.

What would settle it

Substantial drops in accuracy on standard image or video benchmarks when the compressed tokens are used, or large mismatches between the lightweight network outputs and full explainability importance scores.

Figures

Figures reproduced from arXiv: 2506.01097 by Chu Tang, Jie Gu, Jingmin Chen, Lei Lei, Tong Xu, Xiaokang Ma.

Figure 1
Figure 1. Figure 1: Overview of our method. The top portion illustrates the details of our explainability-based com￾pression approach: an explainability method can reveal the important visual tokens (first row, Section 3.2); a lightweight model can then be trained to approximate this explainability and serve as a compression indicator (second row, Section 3.3). The bottom portion shows a general inference framework for MLLMs,… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of Rv obtained via the explainability method (left) and the corresponding token pruning results (right). Based on Rv, the top 50% of visual tokens are retained, while the remaining 50% are pruned (masked in white). All three MLLMs generate the correct answer using only the retained tokens. respect to a given instruction, and subsequently prune those that are less essential. Moreover, we inves… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison results on larger images and longer videos. Performance preservation ratio denotes the proportion of the performance retained relative to the Vanilla model. The average retention ratio refers to the mean proportion of retained tokens across all LLM layers. The first two sub-figures illustrate the average performance preservation of MLLMs across image and video benchmarks. The last two sub-figure… view at source ↗
Figure 4
Figure 4. Figure 4: Video Input Visualizations for LLaVA-OneVision [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Image Input Visualizations for Llava-OneVision. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Video Input Visualizations for Qwen2-VL [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Image Input Visualizations for Qwen2-VL. A.2 Case Study: Explainability Reveals Instruction-Related Visual Tokens To demonstrate the effectiveness of explainability methods in identifying visual tokens that are highly relevant to user instructions, we present two case studies covering both video and image inputs. Given the same input V , the explainability method generates visual relevance scores Rv that s… view at source ↗
Figure 8
Figure 8. Figure 8: Video Input Visualizations for VILA. B Details of Data for Training fθ We train our explainability-based compressor based on subsets sampled from high-quality open-source datasets. First, the details of the sampling are as follows: Image Dataset. For training the compressor used in image tasks, we sample a subset of Infinity-MM that ensures high quality and diversity. The training set primarily consists of… view at source ↗
Figure 9
Figure 9. Figure 9: Case Study 1 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case Study 2. D Efficiency Analysis in Inference To evaluate computational efficiency during inference, we follow FastV and PyramidDrop and report the FLOPs of the visual token part. Specifically, we consider the FLOPs of the multihead attention and the feed-forward network (FFN) modules as: FLOPslayer = 4nd2 + 2n 2 d + lnm, (4) where n is the number of visual tokens, d is the hidden state size, m is the … view at source ↗
read the original abstract

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which aligns well with MLLMs ultimate goal of instruction following. Previous works generally assume that visual tokens achieve better vision-language alignment in the shallow layers of LLMs, which have led to task-related token compression being primarily applied in intermediate LLM layers. In contrast, our study reveals that with proper selection, task-related token compression is feasible at the input stage of LLM with negligible performance loss. This new paradigm significantly reduces task-irrelevant visual tokens and its model-agnostic design enables application without modifying the LLM architecture. Specifically, we suggest that explainability methods for transformer-based architechtures can evaluate the global importance of each visual token with respect to the given instruction, which can effectively guide the task-related token compression for MLLMs. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 13 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the remarkable effectiveness and strong generalization of our approach. Additionally, our new compression paradigm achieves faster inference with reductions in both prefilling time and KV cache memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that task-related visual token compression in MLLMs is feasible at the LLM input stage (rather than intermediate layers) by using explainability methods to score each visual token's global importance w.r.t. the instruction; a lightweight CNN is trained to map first-layer attention maps to these scores, enabling compression without full inference or LLM modification. Experiments on 13 image/video benchmarks across Qwen2-VL, LLaVA-OneVision, and VILA1.5 report negligible performance loss together with reduced prefilling time and KV-cache memory.

Significance. If the first-layer mapping reliably recovers instruction-conditioned importance, the approach would provide a practical, model-agnostic route to early reduction of irrelevant visual tokens, lowering both compute and memory without architectural changes. The separation of the lightweight predictor from the MLLM and the reported cross-model generalization are concrete strengths that could influence efficiency work in multimodal systems.

major comments (2)
  1. [§3 (Method)] §3 (Method): the claim that first-layer attention maps suffice to predict explainability-derived global token importance is load-bearing for the 'negligible performance loss' result. Prior literature (including the paper's own contrast with intermediate-layer methods) indicates that vision-language alignment and instruction relevance typically strengthen in deeper layers; if the learned CNN mapping fails to recover deeper task-conditioned importance, selected tokens will include irrelevant ones or drop relevant ones, directly undermining the central efficiency claim.
  2. [§4 (Experiments)] §4 (Experiments): the reported positive results on 13 benchmarks lack explicit comparison to strong input-stage baselines, per-benchmark compression ratios, statistical significance tests, and ablation on the CNN architecture or training data. Without these, it is unclear whether the observed 'negligible' degradation is robust or partly attributable to post-hoc threshold selection.
minor comments (2)
  1. [Abstract] Abstract: 'architechtures' is a typo and should read 'architectures'.
  2. [Abstract and §3] Abstract and §3: the phrase 'with proper selection' is underspecified; the exact importance threshold or top-k rule used for compression should be stated explicitly and held fixed across models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [§3 (Method)] the claim that first-layer attention maps suffice to predict explainability-derived global token importance is load-bearing for the 'negligible performance loss' result. Prior literature (including the paper's own contrast with intermediate-layer methods) indicates that vision-language alignment and instruction relevance typically strengthen in deeper layers; if the learned CNN mapping fails to recover deeper task-conditioned importance, selected tokens will include irrelevant ones or drop relevant ones, directly undermining the central efficiency claim.

    Authors: We thank the referee for this observation. While deeper layers frequently exhibit stronger alignment, our method first computes global token importance via explainability applied to the complete model response conditioned on the instruction; the CNN is then trained to regress these scores from first-layer attention alone. The empirical results across three MLLMs and 13 benchmarks indicate that this distilled mapping preserves task-relevant tokens sufficiently to yield negligible performance loss. In the revision we expand the discussion in §3 with additional analysis of the correlation between first-layer attention and the explainability scores, together with references to prior work on the utility of early-layer representations for importance estimation. revision: partial

  2. Referee: [§4 (Experiments)] the reported positive results on 13 benchmarks lack explicit comparison to strong input-stage baselines, per-benchmark compression ratios, statistical significance tests, and ablation on the CNN architecture or training data. Without these, it is unclear whether the observed 'negligible' degradation is robust or partly attributable to post-hoc threshold selection.

    Authors: We agree that these additions would improve clarity and robustness. In the revised manuscript we include direct comparisons against strong input-stage baselines (random pruning and first-layer attention thresholding). We now report per-benchmark compression ratios and the average ratio achieved. Statistical significance of performance differences versus the uncompressed baseline is evaluated with paired t-tests. We add ablations varying CNN depth, filter sizes, and the source of training supervision. Threshold selection is clarified as being performed on a held-out validation set to target a prescribed ratio, and we present results across a range of ratios to demonstrate stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent explainability and separate lightweight training

full rationale

The paper derives its input-stage compression from established external explainability techniques that compute token importance with respect to the instruction on the full model, followed by training an independent lightweight CNN to map only the first-layer attention to those importance scores. This mapping is learned separately and does not redefine or fit any quantity inside the MLLM itself; downstream performance is measured on external benchmarks across three distinct MLLMs. No step reduces a claimed prediction to a fitted parameter by construction, invokes a self-citation as the sole justification for a uniqueness claim, or renames an empirical pattern as a new derivation. The approach remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer attention properties and the effectiveness of existing explainability methods; no new free parameters, axioms beyond domain assumptions, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Explainability methods can compute global importance of visual tokens relative to an instruction in transformer architectures
    Invoked when proposing to use these methods to guide token compression at input stage.

pith-pipeline@v0.9.0 · 5824 in / 1279 out tokens · 61419 ms · 2026-05-19T10:55:27.489048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

    cs.CL 2026-01 unverdicted novelty 5.0

    The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper

  1. [1]

    Gemini: A family of highly capable multimodal models

    Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models. arXiv, 2023

  2. [2]

    Qwen-vl: A frontier large vision- language model with versatile abilities

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, et al. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv, 2023

  3. [3]

    Deepseek LLM: scaling open- source language models with longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, et al. Deepseek LLM: scaling open- source language models with longtermism. arXiv, 2024

  4. [4]

    Token merging: Your vit but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, et al. Token merging: Your vit but faster. In ICLR, 2023

  5. [5]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners. arXiv, 2020

  6. [6]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, 2021

  7. [7]

    Transformer interpretability beyond attention visualization

    Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In CVPR, pages 782–791, 2021

  8. [8]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, et al. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In ECCV, 2024

  9. [9]

    Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

  10. [10]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv, 2024

  11. [11]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv, 2023

  12. [12]

    Xception: Deep learning with depthwise separable convolutions

    François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017

  13. [13]

    Fu, Stefano Ermon, Atri Rudra, et al

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, et al. Flashattention: Fast and memory- efficient exact attention with io-awareness. In NeurIPS, 2022

  14. [14]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In MM, 2024

  15. [15]

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, et al. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. In NeurIPS, 2024

  16. [16]

    MME: A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv, 2023

  17. [17]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv, 2024

  18. [18]

    Exploiting behavioral consistence for universal user representation

    Jie Gu, Feng Wang, Qinghui Sun, Zhiquan Ye, et al. Exploiting behavioral consistence for universal user representation. In AAAI, 2021

  19. [19]

    Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

    Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. arXiv, 2024

  20. [20]

    Prunevid: Visual token pruning for efficient video large language models

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. arXiv, 2024

  21. [21]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 10

  22. [22]

    Llava-onevision: Easy visual task transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, et al. Llava-onevision: Easy visual task transfer. arXiv, 2024

  23. [23]

    Seed-bench: Benchmarking multimodal llms with generative comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, et al. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv, 2023

  24. [24]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. In ICML, 2023

  25. [25]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024

  26. [26]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In ECCV, 2024

  27. [27]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  28. [28]

    NVILA: efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, et al. NVILA: efficient frontier visual language models. arXiv, 2024

  29. [29]

    GPT-4 technical report

    OpenAI. GPT-4 technical report. arXiv, 2023

  30. [30]

    Instruction tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, et al. Instruction tuning with GPT-4. arXiv, 2023

  31. [31]

    Efficiently scaling transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, et al. Efficiently scaling transformer inference. In Conf. Mach. Learn. Syst., 2023

  32. [32]

    Fastvid: Dynamic density pruning for fast video large language models

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, et al. Fastvid: Dynamic density pruning for fast video large language models. arXiv, 2025

  33. [33]

    Tempme: Video temporal token merging for efficient text-video retrieval

    Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, et al. Tempme: Video temporal token merging for efficient text-video retrieval. In ICLR, 2025

  34. [34]

    Tokencarve: Information-preserving visual token compression in multimodal large language models

    Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, et al. Tokencarve: Information-preserving visual token compression in multimodal large language models. arXiv, 2025

  35. [35]

    Llama: Open and efficient foundation language models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, et al. Llama: Open and efficient foundation language models. arXiv, 2023

  36. [36]

    Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned

    Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, et al. Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019

  37. [37]

    FOLDER: accelerating multi-modal large language models with enhanced performance

    Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, et al. FOLDER: accelerating multi-modal large language models with enhanced performance. arXiv, 2025

  38. [38]

    Dynamic-vlm: Simple dynamic visual token compression for videollm

    Han Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, et al. Dynamic-vlm: Simple dynamic visual token compression for videollm. arXiv, 2024

  39. [39]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv, 2024

  40. [40]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In CVPR, 2021

  41. [41]

    Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv, 2024

  42. [42]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, et al. Visionzip: Longer is better but not necessary in vision language models. arXiv, 2024

  43. [43]

    Deco: Decoupling token compression from semantic abstraction in multimodal large language models

    Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, et al. Deco: Decoupling token compression from semantic abstraction in multimodal large language models. arXiv, 2024

  44. [44]

    Mm-vet: Evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, et al. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024

  45. [45]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, et al. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019

  46. [46]

    Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

    Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv, 2024

  47. [47]

    Sparsevlm: Visual token sparsification for efficient vision-language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv, 2024. 11

  48. [48]

    Video instruction tuning with synthetic data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, et al. Video instruction tuning with synthetic data. arXiv, 2024

  49. [49]

    A stitch in time saves nine: Small VLM is a precise guidance for accelerating large vlms

    Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, et al. A stitch in time saves nine: Small VLM is a precise guidance for accelerating large vlms. arXiv, 2024

  50. [50]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, et al. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024

  51. [51]

    Focusllava: A coarse-to-fine approach for efficient and effective visual token compression

    Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, and Sheng Guo. Focusllava: A coarse-to-fine approach for efficient and effective visual token compression. arXiv, 2024. 12 A More Visualization Results A.1 Visualization Results Across Different MLLMs We present visualization results for LLaV A-OneVision, Qwen2-VL, and VILA1.5 on both video and image inputs in Fi...