Recognition: unknown
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3
The pith
Evolutionary labeling trains a compressor that cuts visual tokens by three times while retaining 99.3 percent of original accuracy in multimodal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoComp introduces a lightweight encoder-only transformer compressor trained with supervision from a semantic-guided evolutionary labeling procedure. The labeling step searches token subsets that minimize the downstream MLLM output loss while enforcing diversity via vocabulary-based token grouping. The compressor is optimized with a gradient-harmonized loss plus cosine-similarity regularization that encourages separation between kept and dropped tokens, allowing three-fold compression with only 0.7 percent average accuracy drop across vision-language benchmarks.
What carries the argument
The semantic-guided evolutionary labeling procedure that searches for token subsets minimizing MLLM output loss while enforcing diversity through vocabulary-based grouping.
If this is right
- Token budgets for high-resolution or multi-image inputs can be reduced by a factor of three without retraining the underlying MLLM.
- Inference latency on mobile hardware improves by up to 1.6 times while task performance stays nearly identical.
- The compressor generalizes across multiple vision-language benchmarks and outperforms attention-score or similarity-based selection rules.
- No per-image or per-task re-optimization of the compressor is required once it has been trained on the evolutionary labels.
Where Pith is reading between the lines
- The same evolutionary-labeling idea could be applied to compress tokens in video or audio MLLMs if a suitable loss proxy for those modalities is defined.
- If the compressor proves robust across many backbones, it could be packaged as a fixed preprocessing module for any existing MLLM deployment.
- Running the evolutionary search on a broader and more diverse set of examples during label generation might further improve generalization to novel tasks.
Load-bearing premise
The evolutionary search performed on a fixed set of training examples produces token labels that remain effective for new images, unseen tasks, and different MLLM backbones.
What would settle it
Apply the trained compressor to a new collection of high-resolution images and a different MLLM backbone never used in the labeling search, then measure whether accuracy retention drops below 95 percent on standard vision-language benchmarks.
Figures
read the original abstract
Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes EvoComp, a visual token compression framework for MLLMs consisting of a lightweight encoder-only transformer compressor. Supervision for the compressor is generated via an evolutionary search procedure that identifies token subsets minimizing the target MLLM's output loss on a collection of examples, with tokens grouped by vocabulary semantics to enforce diversity. The compressor is trained with a composite loss (GHM loss plus cosine similarity regularization between retained and discarded tokens). The central claims are that EvoComp retains 99.3% of original accuracy at 3x compression, outperforms attention- and similarity-based baselines, and yields up to 1.6x speedup on mobile devices across vision-language benchmarks.
Significance. If the reported performance generalizes, EvoComp would provide a practical, learned alternative to heuristic token pruning that jointly accounts for visual and textual context, potentially enabling more efficient inference for high-resolution or multi-image MLLM applications on resource-limited hardware. The evolutionary labeling approach to supervision is a distinctive element that could inspire similar label-generation strategies in other compression or pruning settings.
major comments (2)
- [Abstract] Abstract and experimental results: The central claim of 99.3% accuracy retention at 3x compression (and outperformance over baselines) is presented without any mention of the number of experimental runs, variance across runs, statistical significance tests, or whether the evolutionary search examples overlap with the evaluation benchmarks. Because the supervision signal is derived directly from MLLM loss on a fixed example set, these controls are load-bearing for establishing that the compressor has learned transferable token selection rather than benchmark-specific patterns.
- [Evolutionary Labeling Strategy] Evolutionary labeling procedure: The method generates labels by evolutionary search minimizing MLLM output loss on a fixed collection of examples while grouping tokens by vocabulary semantics. No details are given on the size, diversity, or hold-out status of this search set, nor any ablation isolating label generalization from per-benchmark tuning. Without such evidence, it is unclear whether the GHM loss plus cosine regularization produces supervision that transfers to new images, tasks, or MLLM backbones, which directly underpins the claimed superiority over attention- and similarity-based methods.
minor comments (2)
- The description of the compressor architecture (lightweight encoder-only transformer) would benefit from an explicit diagram or pseudocode showing how visual and textual contexts are jointly encoded.
- Clarify the precise definition of the cosine similarity regularization term and its weighting relative to the GHM loss, including any hyperparameter sensitivity analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and methodological transparency. We address each major comment below and have prepared revisions to incorporate the requested details, ablations, and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental results: The central claim of 99.3% accuracy retention at 3x compression (and outperformance over baselines) is presented without any mention of the number of experimental runs, variance across runs, statistical significance tests, or whether the evolutionary search examples overlap with the evaluation benchmarks. Because the supervision signal is derived directly from MLLM loss on a fixed example set, these controls are load-bearing for establishing that the compressor has learned transferable token selection rather than benchmark-specific patterns.
Authors: We agree that these statistical controls and details on the search set are essential to substantiate the claims of transferable learning. In the revised manuscript, we will report results averaged over 5 independent runs with standard deviations, include paired t-tests confirming statistical significance (p < 0.05) against baselines, and explicitly state that the evolutionary search examples are drawn from a held-out collection of 10,000 image-text pairs with no overlap to any evaluation benchmark test sets. We will also add a brief ablation in Section 4 demonstrating consistent performance under non-overlapping search conditions. These changes will be reflected in both the abstract and experimental sections. revision: yes
-
Referee: [Evolutionary Labeling Strategy] Evolutionary labeling procedure: The method generates labels by evolutionary search minimizing MLLM output loss on a fixed collection of examples while grouping tokens by vocabulary semantics. No details are given on the size, diversity, or hold-out status of this search set, nor any ablation isolating label generalization from per-benchmark tuning. Without such evidence, it is unclear whether the GHM loss plus cosine regularization produces supervision that transfers to new images, tasks, or MLLM backbones, which directly underpins the claimed superiority over attention- and similarity-based methods.
Authors: We acknowledge the need for greater transparency here. The revised Section 3.2 will specify that the search set comprises 10,000 examples selected for diversity across visual content, task types (VQA, captioning, reasoning), and textual query styles, drawn from training distributions but held out from all evaluation benchmarks. We will include an ablation isolating the contribution of evolutionary labels versus heuristic alternatives, plus cross-benchmark transfer results showing the compressor maintains performance on unseen tasks and image distributions. For new MLLM backbones, our current experiments focus on the primary model but we will add preliminary transfer results on a secondary backbone to support the generalization claim; full cross-backbone validation is noted as a direction for future work. revision: yes
Circularity Check
No significant circularity detected; derivation remains independent of its outputs.
full rationale
The paper generates supervision labels via an external evolutionary search that minimizes the target MLLM's output loss on a fixed example set, then trains a separate lightweight compressor to predict those labels. This label-generation step is defined outside the compressor parameters and does not reduce to a self-referential definition or fitted input renamed as prediction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing premises in the provided text. The claimed performance (99.3% accuracy retention) is presented as an empirical outcome of training on these externally derived labels rather than a quantity forced by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[2]
Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1773– 1781, 2025. 1, 3, 6
2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023. 1, 3
2023
-
[5]
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 3, 6
2024
-
[6]
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 1
-
[7]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[8]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolu- tions from 336 pixels to 4k hd.Advances in Neural Informa- tion Processing Systems, 37:42566–42592, 2024. 1
2024
-
[9]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, pages 8469–8488. PMLR,
-
[10]
Study on den- sity peaks clustering based on k-nearest neighbors and prin- cipal component analysis.Knowledge-Based Systems, 99: 135–145, 2016
Mingjing Du, Shifei Ding, and Hongjie Jia. Study on den- sity peaks clustering based on k-nearest neighbors and prin- cipal component analysis.Knowledge-Based Systems, 99: 135–145, 2016. 8
2016
-
[11]
GitHub - ggerganov/llama.cpp: LLM in- ference in C/C++, 2025
Georgi Gerganov. GitHub - ggerganov/llama.cpp: LLM in- ference in C/C++, 2025. 3
2025
-
[12]
Goldberg.Genetic Algorithms in Search, Optimization, and Machine Learning
D.E. Goldberg.Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989. 2, 3
1989
-
[13]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 6, 1
2017
-
[14]
When attention sink emerges in language models: An empirical view
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1, 3
2025
-
[15]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608– 3617, 2018. 6, 2
2018
-
[16]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022. 1
2022
-
[17]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 6, 1
2019
-
[18]
Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Re- search, 2024
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Re- search, 2024. 1
2024
-
[19]
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training- free visual token pruning for multi-modal large language models from the perspective of graph.arXiv preprint arXiv:2501.02268, 2025. 1, 3
-
[20]
Efficient multimodal large language models: A survey.arXiv preprint arXiv:2405.10739, 2024
Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, et al. Efficient multimodal large language models: A survey.arXiv preprint arXiv:2405.10739, 2024. 1, 3
-
[21]
Gradient harmonized single-stage detector
Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. InProceedings of the AAAI conference on artificial intelligence, pages 8577–8584, 2019. 2, 4, 5, 1
2019
-
[22]
LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
-
[23]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 1, 3
work page internal anchor Pith review arXiv 2024
-
[24]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1
2023
-
[25]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 292–305, 2023. 6, 1
2023
-
[26]
Not all patches are what you need: Expediting vision transformers via token reorganiza- tions
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 1
2022
-
[27]
Langbridge: Interpreting image as a combination of language embeddings
Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, et al. Langbridge: Interpreting image as a combination of language embeddings.arXiv preprint arXiv:2503.19404,
-
[28]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017. 4, 8
2017
-
[29]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3
2023
-
[30]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1, 2, 3, 6
2024
-
[31]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 3, 6
2024
-
[32]
Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12,
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12,
-
[33]
Multi-stage vision token drop- ping: Towards efficient multimodal large language model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024. 1, 3, 6
-
[34]
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, and Honggang Chen. Global compression com- mander: Plug-and-play inference acceleration for high- resolution large vision-language models.arXiv preprint arXiv:2501.05179, 2025. 1, 3, 6
-
[35]
Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233. Springer, 2024. 6, 1
2024
-
[36]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[37]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, pages 3195–3204, 2019. 1
2019
-
[38]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pages 947–952. IEEE, 2019. 1
2019
-
[39]
Intriguing properties of vision transform- ers.Advances in Neural Information Processing Systems, 34: 23296–23308, 2021
Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transform- ers.Advances in Neural Information Processing Systems, 34: 23296–23308, 2021. 1
2021
-
[40]
Multi-modal auto-regressive modeling via visual tokens
Tianshuo Peng, Zuchao Li, Lefei Zhang, Hai Zhao, Ping Wang, and Bo Du. Multi-modal auto-regressive modeling via visual tokens. InProceedings of the 32nd ACM In- ternational Conference on Multimedia, pages 10735–10744,
-
[41]
John Wiley & Sons, 2017
Alain P ´etrowski and Sana Ben-Hamida.Evolutionary algo- rithms. John Wiley & Sons, 2017. 2, 3
2017
-
[42]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 3
2021
-
[43]
Large-scale evolution of image classifiers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Ku- rakin. Large-scale evolution of image classifiers. InInterna- tional Conference on Machine Learning, pages 2902–2911. PMLR, 2017. 2, 3
2017
-
[44]
A-okvqa: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean Conference on Computer Vision, pages 146–162. Springer, 2022. 1
2022
-
[45]
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,
-
[46]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 1
2019
-
[47]
Muntasir Wahed, Kiet A Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou. Prima: Multi-image vision-language models for reasoning segmentation.arXiv preprint arXiv:2412.15209, 2024. 1
-
[48]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024. 1
-
[50]
arXiv preprint arXiv:2502.11501 , year=
Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Lin- feng Zhang. Token pruning in multimodal large language models: Are we solving the right problem?arXiv preprint arXiv:2502.11501, 2025. 1, 3
-
[51]
Stop looking for important tokens in multimodal language models: Duplication matters more
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 1, 3, 6
-
[52]
On the emergence of position bias in transformers
Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 3
2025
-
[53]
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1, 3, 6
-
[54]
Topv: Compatible token pruning with inference time optimization for fast and low-memory multi- modal vision language model
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, and Bo Yuan. Topv: Compatible token pruning with inference time optimization for fast and low-memory multi- modal vision language model. InProceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR), pages 19803–19813, 2025. 1
2025
-
[55]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 3, 6
2025
-
[56]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[57]
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 1
-
[58]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 3
2023
-
[59]
Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learn- ing, 2025. 1, 3, 6
2025
-
[60]
Can clip count stars? an empirical study on quantity bias in clip
Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, and Chen- liang Xu. Treat visual tokens as text? but your mllm only needs fewer efforts to see.arXiv preprint arXiv:2410.06169,
-
[61]
Fengwei Zhou, Jiafei Song, Wenjin Jason Li, Gengjian Xue, Zhikang Zhao, Yichao Lu, and Bailin Na. Mooscomp: Im- proving lightweight long-context compressor via mitigat- ing over-smoothing and incorporating outlier scores.arXiv preprint arXiv:2504.16786, 2025. 2, 5
-
[62]
autoregressive
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1 EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labelin...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.