Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective

Chu Tang; Jie Gu; Jingmin Chen; Lei Lei; Tong Xu; Xiaokang Ma

arxiv: 2506.01097 · v2 · submitted 2025-06-01 · 💻 cs.CV

Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective

Lei Lei , Jie Gu , Xiaokang Ma , Chu Tang , Jingmin Chen , Tong Xu This is my paper

Pith reviewed 2026-05-19 10:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsvisual token compressionexplainabilityattention mapstask-related pruninginference efficiencymodel-agnostic design

0 comments

The pith

Task-related visual token compression works at the input stage of multimodal LLMs when guided by explainability scores, with negligible performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that visual tokens can be pruned for relevance to the user instruction right at the start of LLM processing rather than in intermediate layers. Explainability techniques measure each token's global importance to the task, and a lightweight convolutional network approximates those scores from the first layer's attention map alone. This early selection removes many task-irrelevant tokens while preserving model output quality. The approach requires no changes to the underlying LLM and trains independently of it. Tests on thirteen image and video benchmarks with three different MLLMs show reduced computation, shorter prefilling times, and smaller memory use during inference.

Core claim

Explainability methods for transformer architectures can evaluate the global importance of each visual token with respect to a given instruction; this importance can be learned as a mapping from the first LLM layer's attention map by a simple lightweight convolutional network, enabling effective task-related token compression directly at the LLM input stage without full inference or any modification to the model architecture.

What carries the argument

Lightweight convolutional network that maps first-layer attention maps to global token importance scores obtained from explainability methods.

If this is right

Fewer visual tokens reach the LLM, directly lowering computational costs during both training and inference.
Prefilling time decreases and KV cache memory usage shrinks because irrelevant tokens are removed early.
The method applies to existing MLLMs without any architecture changes or retraining of the core model.
Performance holds across image and video tasks on multiple leading MLLMs, showing broad applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-attention mapping idea could be tested on audio or text tokens in other multimodal settings.
Hardware-constrained deployments might gain larger effective context windows by adopting this pruning step.
Combining the approach with later-stage optimizations like quantization could compound efficiency gains.

Load-bearing premise

Explainability methods can accurately assess the global importance of individual visual tokens relative to an instruction, and this importance is reliably predictable from first-layer attention maps alone.

What would settle it

Substantial drops in accuracy on standard image or video benchmarks when the compressed tokens are used, or large mismatches between the lightweight network outputs and full explainability importance scores.

Figures

Figures reproduced from arXiv: 2506.01097 by Chu Tang, Jie Gu, Jingmin Chen, Lei Lei, Tong Xu, Xiaokang Ma.

**Figure 1.** Figure 1: Overview of our method. The top portion illustrates the details of our explainability-based compression approach: an explainability method can reveal the important visual tokens (first row, Section 3.2); a lightweight model can then be trained to approximate this explainability and serve as a compression indicator (second row, Section 3.3). The bottom portion shows a general inference framework for MLLMs,… view at source ↗

**Figure 2.** Figure 2: Visualization of Rv obtained via the explainability method (left) and the corresponding token pruning results (right). Based on Rv, the top 50% of visual tokens are retained, while the remaining 50% are pruned (masked in white). All three MLLMs generate the correct answer using only the retained tokens. respect to a given instruction, and subsequently prune those that are less essential. Moreover, we inves… view at source ↗

**Figure 3.** Figure 3: Comparison results on larger images and longer videos. Performance preservation ratio denotes the proportion of the performance retained relative to the Vanilla model. The average retention ratio refers to the mean proportion of retained tokens across all LLM layers. The first two sub-figures illustrate the average performance preservation of MLLMs across image and video benchmarks. The last two sub-figure… view at source ↗

**Figure 4.** Figure 4: Video Input Visualizations for LLaVA-OneVision [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Image Input Visualizations for Llava-OneVision. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Video Input Visualizations for Qwen2-VL [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Image Input Visualizations for Qwen2-VL. A.2 Case Study: Explainability Reveals Instruction-Related Visual Tokens To demonstrate the effectiveness of explainability methods in identifying visual tokens that are highly relevant to user instructions, we present two case studies covering both video and image inputs. Given the same input V , the explainability method generates visual relevance scores Rv that s… view at source ↗

**Figure 8.** Figure 8: Video Input Visualizations for VILA. B Details of Data for Training fθ We train our explainability-based compressor based on subsets sampled from high-quality open-source datasets. First, the details of the sampling are as follows: Image Dataset. For training the compressor used in image tasks, we sample a subset of Infinity-MM that ensures high quality and diversity. The training set primarily consists of… view at source ↗

**Figure 9.** Figure 9: Case Study 1 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Case Study 2. D Efficiency Analysis in Inference To evaluate computational efficiency during inference, we follow FastV and PyramidDrop and report the FLOPs of the visual token part. Specifically, we consider the FLOPs of the multihead attention and the feed-forward network (FFN) modules as: FLOPslayer = 4nd2 + 2n 2 d + lnm, (4) where n is the number of visual tokens, d is the hidden state size, m is the … view at source ↗

read the original abstract

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which aligns well with MLLMs ultimate goal of instruction following. Previous works generally assume that visual tokens achieve better vision-language alignment in the shallow layers of LLMs, which have led to task-related token compression being primarily applied in intermediate LLM layers. In contrast, our study reveals that with proper selection, task-related token compression is feasible at the input stage of LLM with negligible performance loss. This new paradigm significantly reduces task-irrelevant visual tokens and its model-agnostic design enables application without modifying the LLM architecture. Specifically, we suggest that explainability methods for transformer-based architechtures can evaluate the global importance of each visual token with respect to the given instruction, which can effectively guide the task-related token compression for MLLMs. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 13 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the remarkable effectiveness and strong generalization of our approach. Additionally, our new compression paradigm achieves faster inference with reductions in both prefilling time and KV cache memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper moves token compression to the input stage using explainability scores predicted from first-layer attention, with experiments showing small losses across three MLLMs.

read the letter

The main thing to know is that this work claims task-related visual token compression works at the LLM input stage, not just in intermediate layers, by guiding selection with explainability and learning a lightweight CNN mapper from first-layer attention maps. They report this keeps performance nearly intact while cutting prefilling time and KV cache on image and video tasks. The model-agnostic angle, where nothing inside the LLM changes, is the practical hook. Experiments run on Qwen2-VL, LLaVA-OneVision, and VILA1.5 across 13 benchmarks, which gives the results some breadth. That scope is the clearest strength; the numbers on speed and memory look consistent enough to take seriously if the baselines are fair. The shift away from intermediate-layer assumptions is also a clear difference from the cited prior work. The soft spot sits in the core assumption that first-layer attention plus a simple convolutional mapper can recover instruction-specific global importance. Most vision-language alignment builds later, so the first layer tends to capture lower-level patterns. If the mapping mostly picks salient visual regions rather than truly task-conditioned ones, the negligible-loss claim could be narrower than presented. I would want to see more on whether the selected tokens align with human or deeper-model judgments of relevance, and how sensitive results are to the choice of explainability method. The training of the mapper is independent and cheap, which helps, but any post-hoc tuning in the reported runs would need checking. This is for groups working on efficient multimodal inference and token pruning. Readers who care about early-stage compression or attention-based selection will find usable ideas here. The empirical coverage is wide enough that it deserves a serious referee even if the first-layer justification needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that task-related visual token compression in MLLMs is feasible at the LLM input stage (rather than intermediate layers) by using explainability methods to score each visual token's global importance w.r.t. the instruction; a lightweight CNN is trained to map first-layer attention maps to these scores, enabling compression without full inference or LLM modification. Experiments on 13 image/video benchmarks across Qwen2-VL, LLaVA-OneVision, and VILA1.5 report negligible performance loss together with reduced prefilling time and KV-cache memory.

Significance. If the first-layer mapping reliably recovers instruction-conditioned importance, the approach would provide a practical, model-agnostic route to early reduction of irrelevant visual tokens, lowering both compute and memory without architectural changes. The separation of the lightweight predictor from the MLLM and the reported cross-model generalization are concrete strengths that could influence efficiency work in multimodal systems.

major comments (2)

[§3 (Method)] §3 (Method): the claim that first-layer attention maps suffice to predict explainability-derived global token importance is load-bearing for the 'negligible performance loss' result. Prior literature (including the paper's own contrast with intermediate-layer methods) indicates that vision-language alignment and instruction relevance typically strengthen in deeper layers; if the learned CNN mapping fails to recover deeper task-conditioned importance, selected tokens will include irrelevant ones or drop relevant ones, directly undermining the central efficiency claim.
[§4 (Experiments)] §4 (Experiments): the reported positive results on 13 benchmarks lack explicit comparison to strong input-stage baselines, per-benchmark compression ratios, statistical significance tests, and ablation on the CNN architecture or training data. Without these, it is unclear whether the observed 'negligible' degradation is robust or partly attributable to post-hoc threshold selection.

minor comments (2)

[Abstract] Abstract: 'architechtures' is a typo and should read 'architectures'.
[Abstract and §3] Abstract and §3: the phrase 'with proper selection' is underspecified; the exact importance threshold or top-k rule used for compression should be stated explicitly and held fixed across models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [§3 (Method)] the claim that first-layer attention maps suffice to predict explainability-derived global token importance is load-bearing for the 'negligible performance loss' result. Prior literature (including the paper's own contrast with intermediate-layer methods) indicates that vision-language alignment and instruction relevance typically strengthen in deeper layers; if the learned CNN mapping fails to recover deeper task-conditioned importance, selected tokens will include irrelevant ones or drop relevant ones, directly undermining the central efficiency claim.

Authors: We thank the referee for this observation. While deeper layers frequently exhibit stronger alignment, our method first computes global token importance via explainability applied to the complete model response conditioned on the instruction; the CNN is then trained to regress these scores from first-layer attention alone. The empirical results across three MLLMs and 13 benchmarks indicate that this distilled mapping preserves task-relevant tokens sufficiently to yield negligible performance loss. In the revision we expand the discussion in §3 with additional analysis of the correlation between first-layer attention and the explainability scores, together with references to prior work on the utility of early-layer representations for importance estimation. revision: partial
Referee: [§4 (Experiments)] the reported positive results on 13 benchmarks lack explicit comparison to strong input-stage baselines, per-benchmark compression ratios, statistical significance tests, and ablation on the CNN architecture or training data. Without these, it is unclear whether the observed 'negligible' degradation is robust or partly attributable to post-hoc threshold selection.

Authors: We agree that these additions would improve clarity and robustness. In the revised manuscript we include direct comparisons against strong input-stage baselines (random pruning and first-layer attention thresholding). We now report per-benchmark compression ratios and the average ratio achieved. Statistical significance of performance differences versus the uncompressed baseline is evaluated with paired t-tests. We add ablations varying CNN depth, filter sizes, and the source of training supervision. Threshold selection is clarified as being performed on a held-out validation set to target a prescribed ratio, and we present results across a range of ratios to demonstrate stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent explainability and separate lightweight training

full rationale

The paper derives its input-stage compression from established external explainability techniques that compute token importance with respect to the instruction on the full model, followed by training an independent lightweight CNN to map only the first-layer attention to those importance scores. This mapping is learned separately and does not redefine or fit any quantity inside the MLLM itself; downstream performance is measured on external benchmarks across three distinct MLLMs. No step reduces a claimed prediction to a fitted parameter by construction, invokes a self-citation as the sole justification for a uniqueness claim, or renames an empirical pattern as a new derivation. The approach remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer attention properties and the effectiveness of existing explainability methods; no new free parameters, axioms beyond domain assumptions, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Explainability methods can compute global importance of visual tokens relative to an instruction in transformer architectures
Invoked when proposing to use these methods to guide token compression at input stage.

pith-pipeline@v0.9.0 · 5824 in / 1279 out tokens · 61419 ms · 2026-05-19T10:55:27.489048+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

explainability methods ... iteratively update a relevance map across layers using gradient-weighted multi-head attentions ... Rt = Rt + Eh(Al_t ⊙ ∇Al_t) · Rt
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learn a mapping from the attention map of the first LLM layer to the explanation results ... simple and lightweight convolutional network

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
cs.CL 2026-01 unverdicted novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper

[1]

Gemini: A family of highly capable multimodal models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models. arXiv, 2023

work page 2023
[2]

Qwen-vl: A frontier large vision- language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, et al. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv, 2023

work page 2023
[3]

Deepseek LLM: scaling open- source language models with longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, et al. Deepseek LLM: scaling open- source language models with longtermism. arXiv, 2024

work page 2024
[4]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, et al. Token merging: Your vit but faster. In ICLR, 2023

work page 2023
[5]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners. arXiv, 2020

work page 2020
[6]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, 2021

work page 2021
[7]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In CVPR, pages 782–791, 2021

work page 2021
[8]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, et al. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In ECCV, 2024

work page 2024
[9]

Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

work page 2024
[10]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv, 2024

work page 2024
[11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv, 2023

work page 2023
[12]

Xception: Deep learning with depthwise separable convolutions

François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017

work page 2017
[13]

Fu, Stefano Ermon, Atri Rudra, et al

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, et al. Flashattention: Fast and memory- efficient exact attention with io-awareness. In NeurIPS, 2022

work page 2022
[14]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In MM, 2024

work page 2024
[15]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, et al. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. In NeurIPS, 2024

work page 2024
[16]

MME: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv, 2023

work page 2023
[17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv, 2024

work page 2024
[18]

Exploiting behavioral consistence for universal user representation

Jie Gu, Feng Wang, Qinghui Sun, Zhiquan Ye, et al. Exploiting behavioral consistence for universal user representation. In AAAI, 2021

work page 2021
[19]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. arXiv, 2024

work page 2024
[20]

Prunevid: Visual token pruning for efficient video large language models

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. arXiv, 2024

work page 2024
[21]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 10

work page 2015
[22]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, et al. Llava-onevision: Easy visual task transfer. arXiv, 2024

work page 2024
[23]

Seed-bench: Benchmarking multimodal llms with generative comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, et al. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv, 2023

work page 2023
[24]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. In ICML, 2023

work page 2023
[25]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024

work page 2024
[26]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In ECCV, 2024

work page 2024
[27]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[28]

NVILA: efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, et al. NVILA: efficient frontier visual language models. arXiv, 2024

work page 2024
[29]

GPT-4 technical report

OpenAI. GPT-4 technical report. arXiv, 2023

work page 2023
[30]

Instruction tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, et al. Instruction tuning with GPT-4. arXiv, 2023

work page 2023
[31]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, et al. Efficiently scaling transformer inference. In Conf. Mach. Learn. Syst., 2023

work page 2023
[32]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, et al. Fastvid: Dynamic density pruning for fast video large language models. arXiv, 2025

work page 2025
[33]

Tempme: Video temporal token merging for efficient text-video retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, et al. Tempme: Video temporal token merging for efficient text-video retrieval. In ICLR, 2025

work page 2025
[34]

Tokencarve: Information-preserving visual token compression in multimodal large language models

Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, et al. Tokencarve: Information-preserving visual token compression in multimodal large language models. arXiv, 2025

work page 2025
[35]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, et al. Llama: Open and efficient foundation language models. arXiv, 2023

work page 2023
[36]

Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, et al. Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019

work page 2019
[37]

FOLDER: accelerating multi-modal large language models with enhanced performance

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, et al. FOLDER: accelerating multi-modal large language models with enhanced performance. arXiv, 2025

work page 2025
[38]

Dynamic-vlm: Simple dynamic visual token compression for videollm

Han Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, et al. Dynamic-vlm: Simple dynamic visual token compression for videollm. arXiv, 2024

work page 2024
[39]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv, 2024

work page 2024
[40]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In CVPR, 2021

work page 2021
[41]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv, 2024

work page 2024
[42]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, et al. Visionzip: Longer is better but not necessary in vision language models. arXiv, 2024

work page 2024
[43]

Deco: Decoupling token compression from semantic abstraction in multimodal large language models

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, et al. Deco: Decoupling token compression from semantic abstraction in multimodal large language models. arXiv, 2024

work page 2024
[44]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, et al. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024

work page 2024
[45]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, et al. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019

work page 2019
[46]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv, 2024

work page 2024
[47]

Sparsevlm: Visual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv, 2024. 11

work page 2024
[48]

Video instruction tuning with synthetic data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, et al. Video instruction tuning with synthetic data. arXiv, 2024

work page 2024
[49]

A stitch in time saves nine: Small VLM is a precise guidance for accelerating large vlms

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, et al. A stitch in time saves nine: Small VLM is a precise guidance for accelerating large vlms. arXiv, 2024

work page 2024
[50]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, et al. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024

work page 2024
[51]

Focusllava: A coarse-to-fine approach for efficient and effective visual token compression

Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, and Sheng Guo. Focusllava: A coarse-to-fine approach for efficient and effective visual token compression. arXiv, 2024. 12 A More Visualization Results A.1 Visualization Results Across Different MLLMs We present visualization results for LLaV A-OneVision, Qwen2-VL, and VILA1.5 on both video and image inputs in Fi...

work page 2024

[1] [1]

Gemini: A family of highly capable multimodal models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models. arXiv, 2023

work page 2023

[2] [2]

Qwen-vl: A frontier large vision- language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, et al. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv, 2023

work page 2023

[3] [3]

Deepseek LLM: scaling open- source language models with longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, et al. Deepseek LLM: scaling open- source language models with longtermism. arXiv, 2024

work page 2024

[4] [4]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, et al. Token merging: Your vit but faster. In ICLR, 2023

work page 2023

[5] [5]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners. arXiv, 2020

work page 2020

[6] [6]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, 2021

work page 2021

[7] [7]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In CVPR, pages 782–791, 2021

work page 2021

[8] [8]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, et al. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In ECCV, 2024

work page 2024

[9] [9]

Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

work page 2024

[10] [10]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv, 2024

work page 2024

[11] [11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv, 2023

work page 2023

[12] [12]

Xception: Deep learning with depthwise separable convolutions

François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017

work page 2017

[13] [13]

Fu, Stefano Ermon, Atri Rudra, et al

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, et al. Flashattention: Fast and memory- efficient exact attention with io-awareness. In NeurIPS, 2022

work page 2022

[14] [14]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In MM, 2024

work page 2024

[15] [15]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, et al. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. In NeurIPS, 2024

work page 2024

[16] [16]

MME: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv, 2023

work page 2023

[17] [17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv, 2024

work page 2024

[18] [18]

Exploiting behavioral consistence for universal user representation

Jie Gu, Feng Wang, Qinghui Sun, Zhiquan Ye, et al. Exploiting behavioral consistence for universal user representation. In AAAI, 2021

work page 2021

[19] [19]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. arXiv, 2024

work page 2024

[20] [20]

Prunevid: Visual token pruning for efficient video large language models

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. arXiv, 2024

work page 2024

[21] [21]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 10

work page 2015

[22] [22]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, et al. Llava-onevision: Easy visual task transfer. arXiv, 2024

work page 2024

[23] [23]

Seed-bench: Benchmarking multimodal llms with generative comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, et al. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv, 2023

work page 2023

[24] [24]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. In ICML, 2023

work page 2023

[25] [25]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024

work page 2024

[26] [26]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In ECCV, 2024

work page 2024

[27] [27]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023

[28] [28]

NVILA: efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, et al. NVILA: efficient frontier visual language models. arXiv, 2024

work page 2024

[29] [29]

GPT-4 technical report

OpenAI. GPT-4 technical report. arXiv, 2023

work page 2023

[30] [30]

Instruction tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, et al. Instruction tuning with GPT-4. arXiv, 2023

work page 2023

[31] [31]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, et al. Efficiently scaling transformer inference. In Conf. Mach. Learn. Syst., 2023

work page 2023

[32] [32]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, et al. Fastvid: Dynamic density pruning for fast video large language models. arXiv, 2025

work page 2025

[33] [33]

Tempme: Video temporal token merging for efficient text-video retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, et al. Tempme: Video temporal token merging for efficient text-video retrieval. In ICLR, 2025

work page 2025

[34] [34]

Tokencarve: Information-preserving visual token compression in multimodal large language models

Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, et al. Tokencarve: Information-preserving visual token compression in multimodal large language models. arXiv, 2025

work page 2025

[35] [35]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, et al. Llama: Open and efficient foundation language models. arXiv, 2023

work page 2023

[36] [36]

Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, et al. Analyzing multi-head self- attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019

work page 2019

[37] [37]

FOLDER: accelerating multi-modal large language models with enhanced performance

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, et al. FOLDER: accelerating multi-modal large language models with enhanced performance. arXiv, 2025

work page 2025

[38] [38]

Dynamic-vlm: Simple dynamic visual token compression for videollm

Han Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, et al. Dynamic-vlm: Simple dynamic visual token compression for videollm. arXiv, 2024

work page 2024

[39] [39]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv, 2024

work page 2024

[40] [40]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In CVPR, 2021

work page 2021

[41] [41]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv, 2024

work page 2024

[42] [42]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, et al. Visionzip: Longer is better but not necessary in vision language models. arXiv, 2024

work page 2024

[43] [43]

Deco: Decoupling token compression from semantic abstraction in multimodal large language models

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, et al. Deco: Decoupling token compression from semantic abstraction in multimodal large language models. arXiv, 2024

work page 2024

[44] [44]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, et al. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024

work page 2024

[45] [45]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, et al. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019

work page 2019

[46] [46]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv, 2024

work page 2024

[47] [47]

Sparsevlm: Visual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv, 2024. 11

work page 2024

[48] [48]

Video instruction tuning with synthetic data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, et al. Video instruction tuning with synthetic data. arXiv, 2024

work page 2024

[49] [49]

A stitch in time saves nine: Small VLM is a precise guidance for accelerating large vlms

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, et al. A stitch in time saves nine: Small VLM is a precise guidance for accelerating large vlms. arXiv, 2024

work page 2024

[50] [50]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, et al. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024

work page 2024

[51] [51]

Focusllava: A coarse-to-fine approach for efficient and effective visual token compression

Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, and Sheng Guo. Focusllava: A coarse-to-fine approach for efficient and effective visual token compression. arXiv, 2024. 12 A More Visualization Results A.1 Visualization Results Across Different MLLMs We present visualization results for LLaV A-OneVision, Qwen2-VL, and VILA1.5 on both video and image inputs in Fi...

work page 2024