arxiv: 2604.17087 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.LG

Recognition: unknown

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Jiafei Song , Fengwei Zhou , Jin Qu , Wenjin Jason Li , Tong Wu , Gengjian Xue , Zhikang Zhao , Daomin Wei

show 2 more authors

Yichao Lu Bailin Na

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords visual token compressionmultimodal large language modelsevolutionary labelingtoken selectioninference efficiencysemantic diversity

0 comments

The pith

Evolutionary labeling trains a compressor that cuts visual tokens by three times while retaining 99.3 percent of original accuracy in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models generate many visual tokens from images or multiple images, which raises memory use and slows inference on edge devices. The paper shows that a small transformer compressor can be trained to select the most useful tokens by using evolutionary search to discover which subsets actually preserve the model's output quality. The search adds vocabulary-based grouping to keep semantic variety among chosen tokens. Training then combines a loss that balances easy and hard examples with a term that pushes retained and discarded tokens apart in feature space. This learned selection beats heuristic rules that only look at attention weights or pairwise similarity.

Core claim

EvoComp introduces a lightweight encoder-only transformer compressor trained with supervision from a semantic-guided evolutionary labeling procedure. The labeling step searches token subsets that minimize the downstream MLLM output loss while enforcing diversity via vocabulary-based token grouping. The compressor is optimized with a gradient-harmonized loss plus cosine-similarity regularization that encourages separation between kept and dropped tokens, allowing three-fold compression with only 0.7 percent average accuracy drop across vision-language benchmarks.

What carries the argument

The semantic-guided evolutionary labeling procedure that searches for token subsets minimizing MLLM output loss while enforcing diversity through vocabulary-based grouping.

If this is right

Token budgets for high-resolution or multi-image inputs can be reduced by a factor of three without retraining the underlying MLLM.
Inference latency on mobile hardware improves by up to 1.6 times while task performance stays nearly identical.
The compressor generalizes across multiple vision-language benchmarks and outperforms attention-score or similarity-based selection rules.
No per-image or per-task re-optimization of the compressor is required once it has been trained on the evolutionary labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evolutionary-labeling idea could be applied to compress tokens in video or audio MLLMs if a suitable loss proxy for those modalities is defined.
If the compressor proves robust across many backbones, it could be packaged as a fixed preprocessing module for any existing MLLM deployment.
Running the evolutionary search on a broader and more diverse set of examples during label generation might further improve generalization to novel tasks.

Load-bearing premise

The evolutionary search performed on a fixed set of training examples produces token labels that remain effective for new images, unseen tasks, and different MLLM backbones.

What would settle it

Apply the trained compressor to a new collection of high-resolution images and a different MLLM backbone never used in the labeling search, then measure whether accuracy retention drops below 95 percent on standard vision-language benchmarks.

Figures

Figures reproduced from arXiv: 2604.17087 by Bailin Na, Daomin Wei, Fengwei Zhou, Gengjian Xue, Jiafei Song, Jin Qu, Tong Wu, Wenjin Jason Li, Yichao Lu, Zhikang Zhao.

**Figure 1.** Figure 1: An overview of the EvoComp framework. Evolutionary Labeling (Right) searches for informative visual tokens that minimize the MLLM task loss, while ensuring non-redundancy by the semantic grouping strategy. Training Phase (Left Bottom) trains a lightweight compressor using the searched labels, optimized with a combination of GHM and cosine similarity loss. Inference Phase (Left Top) applies the trained comp… view at source ↗

**Figure 2.** Figure 2: An example of semantic grouping result. Three repre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation of compressor transferability from LLaVA [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of tokens retained by EvoComp under different compression levels for yes/no questions. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of tokens retained by EvoComp for open-ended questions. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoComp uses evolutionary search on fixed examples to label good visual token subsets for training a compressor, which beats attention baselines in the reported tests but risks overfitting if those labels do not generalize.

read the letter

EvoComp trains a lightweight encoder-only transformer to drop visual tokens in MLLMs by 3x while holding onto 99.3% of original accuracy. The key move is an evolutionary labeling step that searches for token subsets minimizing the downstream MLLM loss, with vocabulary-based grouping to keep semantic diversity, then trains the compressor on those labels using GHM loss plus cosine regularization between kept and dropped tokens. This is the clearest departure from prior attention-score or similarity pruning work, and the experiments claim it outperforms those heuristics across vision-language benchmarks plus delivers measurable mobile speedup.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EvoComp, a visual token compression framework for MLLMs consisting of a lightweight encoder-only transformer compressor. Supervision for the compressor is generated via an evolutionary search procedure that identifies token subsets minimizing the target MLLM's output loss on a collection of examples, with tokens grouped by vocabulary semantics to enforce diversity. The compressor is trained with a composite loss (GHM loss plus cosine similarity regularization between retained and discarded tokens). The central claims are that EvoComp retains 99.3% of original accuracy at 3x compression, outperforms attention- and similarity-based baselines, and yields up to 1.6x speedup on mobile devices across vision-language benchmarks.

Significance. If the reported performance generalizes, EvoComp would provide a practical, learned alternative to heuristic token pruning that jointly accounts for visual and textual context, potentially enabling more efficient inference for high-resolution or multi-image MLLM applications on resource-limited hardware. The evolutionary labeling approach to supervision is a distinctive element that could inspire similar label-generation strategies in other compression or pruning settings.

major comments (2)

[Abstract] Abstract and experimental results: The central claim of 99.3% accuracy retention at 3x compression (and outperformance over baselines) is presented without any mention of the number of experimental runs, variance across runs, statistical significance tests, or whether the evolutionary search examples overlap with the evaluation benchmarks. Because the supervision signal is derived directly from MLLM loss on a fixed example set, these controls are load-bearing for establishing that the compressor has learned transferable token selection rather than benchmark-specific patterns.
[Evolutionary Labeling Strategy] Evolutionary labeling procedure: The method generates labels by evolutionary search minimizing MLLM output loss on a fixed collection of examples while grouping tokens by vocabulary semantics. No details are given on the size, diversity, or hold-out status of this search set, nor any ablation isolating label generalization from per-benchmark tuning. Without such evidence, it is unclear whether the GHM loss plus cosine regularization produces supervision that transfers to new images, tasks, or MLLM backbones, which directly underpins the claimed superiority over attention- and similarity-based methods.

minor comments (2)

The description of the compressor architecture (lightweight encoder-only transformer) would benefit from an explicit diagram or pseudocode showing how visual and textual contexts are jointly encoded.
Clarify the precise definition of the cosine similarity regularization term and its weighting relative to the GHM loss, including any hyperparameter sensitivity analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and methodological transparency. We address each major comment below and have prepared revisions to incorporate the requested details, ablations, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: The central claim of 99.3% accuracy retention at 3x compression (and outperformance over baselines) is presented without any mention of the number of experimental runs, variance across runs, statistical significance tests, or whether the evolutionary search examples overlap with the evaluation benchmarks. Because the supervision signal is derived directly from MLLM loss on a fixed example set, these controls are load-bearing for establishing that the compressor has learned transferable token selection rather than benchmark-specific patterns.

Authors: We agree that these statistical controls and details on the search set are essential to substantiate the claims of transferable learning. In the revised manuscript, we will report results averaged over 5 independent runs with standard deviations, include paired t-tests confirming statistical significance (p < 0.05) against baselines, and explicitly state that the evolutionary search examples are drawn from a held-out collection of 10,000 image-text pairs with no overlap to any evaluation benchmark test sets. We will also add a brief ablation in Section 4 demonstrating consistent performance under non-overlapping search conditions. These changes will be reflected in both the abstract and experimental sections. revision: yes
Referee: [Evolutionary Labeling Strategy] Evolutionary labeling procedure: The method generates labels by evolutionary search minimizing MLLM output loss on a fixed collection of examples while grouping tokens by vocabulary semantics. No details are given on the size, diversity, or hold-out status of this search set, nor any ablation isolating label generalization from per-benchmark tuning. Without such evidence, it is unclear whether the GHM loss plus cosine regularization produces supervision that transfers to new images, tasks, or MLLM backbones, which directly underpins the claimed superiority over attention- and similarity-based methods.

Authors: We acknowledge the need for greater transparency here. The revised Section 3.2 will specify that the search set comprises 10,000 examples selected for diversity across visual content, task types (VQA, captioning, reasoning), and textual query styles, drawn from training distributions but held out from all evaluation benchmarks. We will include an ablation isolating the contribution of evolutionary labels versus heuristic alternatives, plus cross-benchmark transfer results showing the compressor maintains performance on unseen tasks and image distributions. For new MLLM backbones, our current experiments focus on the primary model but we will add preliminary transfer results on a secondary backbone to support the generalization claim; full cross-backbone validation is noted as a direction for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains independent of its outputs.

full rationale

The paper generates supervision labels via an external evolutionary search that minimizes the target MLLM's output loss on a fixed example set, then trains a separate lightweight compressor to predict those labels. This label-generation step is defined outside the compressor parameters and does not reduce to a self-referential definition or fitted input renamed as prediction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing premises in the provided text. The claimed performance (99.3% accuracy retention) is presented as an empirical outcome of training on these externally derived labels rather than a quantity forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that evolutionary search over token subsets can produce reliable training labels for a learned compressor; no explicit free parameters or new physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1109 out tokens · 25575 ms · 2026-05-10T06:57:34.067928+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 20 canonical work pages · 5 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
[2]

Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1773– 1781, 2025. 1, 3, 6

2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023. 1, 3

2023
[5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 3, 6

2024
[6]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 1

work page arXiv 2024
[7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 1

work page internal anchor Pith review arXiv 2024
[8]

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolu- tions from 336 pixels to 4k hd.Advances in Neural Informa- tion Processing Systems, 37:42566–42592, 2024. 1

2024
[9]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, pages 8469–8488. PMLR,
[10]

Study on den- sity peaks clustering based on k-nearest neighbors and prin- cipal component analysis.Knowledge-Based Systems, 99: 135–145, 2016

Mingjing Du, Shifei Ding, and Hongjie Jia. Study on den- sity peaks clustering based on k-nearest neighbors and prin- cipal component analysis.Knowledge-Based Systems, 99: 135–145, 2016. 8

2016
[11]

GitHub - ggerganov/llama.cpp: LLM in- ference in C/C++, 2025

Georgi Gerganov. GitHub - ggerganov/llama.cpp: LLM in- ference in C/C++, 2025. 3

2025
[12]

Goldberg.Genetic Algorithms in Search, Optimization, and Machine Learning

D.E. Goldberg.Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989. 2, 3

1989
[13]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 6, 1

2017
[14]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1, 3

2025
[15]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608– 3617, 2018. 6, 2

2018
[16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022. 1

2022
[17]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 6, 1

2019
[18]

Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Re- search, 2024

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Re- search, 2024. 1

2024
[19]

What kind of visual tokens do we need? training- free visual token pruning for multi-modal large language models from the perspective of graph.arXiv preprint arXiv:2501.02268, 2025

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training- free visual token pruning for multi-modal large language models from the perspective of graph.arXiv preprint arXiv:2501.02268, 2025. 1, 3

work page arXiv 2025
[20]

Efficient multimodal large language models: A survey.arXiv preprint arXiv:2405.10739, 2024

Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, et al. Efficient multimodal large language models: A survey.arXiv preprint arXiv:2405.10739, 2024. 1, 3

work page arXiv 2024
[21]

Gradient harmonized single-stage detector

Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. InProceedings of the AAAI conference on artificial intelligence, pages 8577–8584, 2019. 2, 4, 5, 1

2019
[22]

LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
[23]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

2023
[25]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 292–305, 2023. 6, 1

2023
[26]

Not all patches are what you need: Expediting vision transformers via token reorganiza- tions

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 1

2022
[27]

Langbridge: Interpreting image as a combination of language embeddings

Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, et al. Langbridge: Interpreting image as a combination of language embeddings.arXiv preprint arXiv:2503.19404,

work page arXiv
[28]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017. 4, 8

2017
[29]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3

2023
[30]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1, 2, 3, 6

2024
[31]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 3, 6

2024
[32]

Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12,

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12,
[33]

Multi-stage vision token drop- ping: Towards efficient multimodal large language model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024. 1, 3, 6

work page arXiv 2024
[34]

Global compression com- mander: Plug-and-play inference acceleration for high- resolution large vision-language models.arXiv preprint arXiv:2501.05179, 2025

Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, and Honggang Chen. Global compression com- mander: Plug-and-play inference acceleration for high- resolution large vision-language models.arXiv preprint arXiv:2501.05179, 2025. 1, 3, 6

work page arXiv 2025
[35]

Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233. Springer, 2024. 6, 1

2024
[36]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
[37]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, pages 3195–3204, 2019. 1

2019
[38]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pages 947–952. IEEE, 2019. 1

2019
[39]

Intriguing properties of vision transform- ers.Advances in Neural Information Processing Systems, 34: 23296–23308, 2021

Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transform- ers.Advances in Neural Information Processing Systems, 34: 23296–23308, 2021. 1

2021
[40]

Multi-modal auto-regressive modeling via visual tokens

Tianshuo Peng, Zuchao Li, Lefei Zhang, Hai Zhao, Ping Wang, and Bo Du. Multi-modal auto-regressive modeling via visual tokens. InProceedings of the 32nd ACM In- ternational Conference on Multimedia, pages 10735–10744,
[41]

John Wiley & Sons, 2017

Alain P ´etrowski and Sana Ben-Hamida.Evolutionary algo- rithms. John Wiley & Sons, 2017. 2, 3

2017
[42]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 3

2021
[43]

Large-scale evolution of image classifiers

Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Ku- rakin. Large-scale evolution of image classifiers. InInterna- tional Conference on Machine Learning, pages 2902–2911. PMLR, 2017. 2, 3

2017
[44]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean Conference on Computer Vision, pages 146–162. Springer, 2022. 1

2022
[45]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

work page arXiv
[46]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 1

2019
[47]

Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou

Muntasir Wahed, Kiet A Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou. Prima: Multi-image vision-language models for reasoning segmentation.arXiv preprint arXiv:2412.15209, 2024. 1

work page arXiv 2024
[48]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024

Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024. 1

work page arXiv 2024
[50]

arXiv preprint arXiv:2502.11501 , year=

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Lin- feng Zhang. Token pruning in multimodal large language models: Are we solving the right problem?arXiv preprint arXiv:2502.11501, 2025. 1, 3

work page arXiv 2025
[51]

Stop looking for important tokens in multimodal language models: Duplication matters more

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 1, 3, 6

work page arXiv 2025
[52]

On the emergence of position bias in transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 3

2025
[53]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1, 3, 6

work page arXiv 2024
[54]

Topv: Compatible token pruning with inference time optimization for fast and low-memory multi- modal vision language model

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, and Bo Yuan. Topv: Compatible token pruning with inference time optimization for fast and low-memory multi- modal vision language model. InProceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR), pages 19803–19813, 2025. 1

2025
[55]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 3, 6

2025
[56]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 1

work page internal anchor Pith review arXiv 2024
[57]

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 1

work page arXiv 2024
[58]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 3

2023
[59]

Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learn- ing, 2025. 1, 3, 6

2025
[60]

Can clip count stars? an empirical study on quantity bias in clip

Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, and Chen- liang Xu. Treat visual tokens as text? but your mllm only needs fewer efforts to see.arXiv preprint arXiv:2410.06169,

work page arXiv
[61]

Mooscomp: Improving lightweight long-context compressor via mitigating over- smoothing and incorporating outlier scores.CoRR, abs/2504.16786, 2025

Fengwei Zhou, Jiafei Song, Wenjin Jason Li, Gengjian Xue, Zhikang Zhao, Yichao Lu, and Bailin Na. Mooscomp: Im- proving lightweight long-context compressor via mitigat- ing over-smoothing and incorporating outlier scores.arXiv preprint arXiv:2504.16786, 2025. 2, 5

work page arXiv 2025
[62]

autoregressive

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1 EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labelin...

2024