Recognition: 2 theorem links
· Lean TheoremEvading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
Pith reviewed 2026-05-12 02:25 UTC · model grok-4.3
The pith
COAST prunes 77.8 percent of visual tokens in vision-language models for 2.15 times faster inference while retaining 98.64 percent average performance by using adaptive contrastive routing instead of early attention scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that scalar attention pruning is unreliable for compositional vision-language tasks because tokens that receive low scores early can become critical later for resolving spatial relations and contextual cues. COAST addresses this by casting compression as adaptive semantic routing: it identifies query-specific anchors using native cross-modal attention, estimates contextual dispersion with attention entropy, adapts the retention balance between evidence and context, and employs a contrastive routing score to retain both anchor-aligned tokens and complementary spatial information. This prevents the model from losing visual grounding. On seven benchmarks the method reduces
What carries the argument
COAST's contrastive adaptive semantic routing, which identifies query-specific anchors from cross-modal attention, quantifies dispersion via attention entropy, and applies a contrastive score to preserve both primary evidence and surrounding spatial context.
If this is right
- Vision-language models can achieve over 2x inference speedup on standard hardware while handling the same range of compositional and relational questions.
- The same pruning logic applies across different token budgets and multiple model families without any retraining.
- Inference becomes more reliable on tasks that require tracking multiple objects and their spatial layout rather than relying on text priors.
- Computational cost drops enough to support longer contexts or higher-resolution images under fixed hardware limits.
- Pruning decisions become query-dependent, so simple images use fewer tokens than complex ones.
Where Pith is reading between the lines
- Token importance in multimodal models is not fixed at early layers but shifts with the specific question being asked.
- Similar contrastive routing ideas could extend pruning to video or audio tokens where context also evolves over time.
- Real-time applications such as mobile visual question answering or robotics could adopt the method to stay responsive while keeping image grounding.
- The distinction between anchor evidence and complementary context may help diagnose other failure modes where models ignore parts of their input.
Load-bearing premise
The contrastive routing score derived from attention patterns can consistently select the visual tokens needed for any query's reasoning without missing key information or needing task-specific changes.
What would settle it
Apply COAST to a benchmark of complex spatial puzzles or scenes with rare secondary objects and check whether average performance falls more than two percent below the unpruned baseline or whether specific cases of lost visual grounding appear.
Figures
read the original abstract
Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that scalar attention-based pruning of visual tokens in vision-language models induces a failure mode called Visual Aphasia, where early low-attention tokens become critical later for compositional reasoning involving secondary objects and spatial relations. It introduces COAST, a training-free adaptive semantic token pruning framework that uses native cross-modal attention to identify query-specific anchors, attention entropy for contextual dispersion, and a contrastive routing score to balance semantic evidence with spatial context. The method is reported to reduce visual tokens by 77.8%, achieve 2.15x latency speedup, and retain 98.64% of original average performance across seven benchmarks while outperforming strong baselines and generalizing across LVLM families.
Significance. If the empirical claims hold, the work could meaningfully advance efficient inference for large vision-language models by demonstrating that adaptive, contrastive routing can mitigate a plausible limitation of one-shot scalar pruning without requiring training or task-specific tuning. The training-free design and reported cross-model generalization are notable strengths that could support practical adoption.
major comments (2)
- [§3] §3 (method): The contrastive routing score is the load-bearing component for the claim that COAST preserves both anchor-aligned evidence and complementary context without new failure modes; however, it is described only procedurally with no equations, parameter analysis, or ablations against alternatives, leaving the reliability of the 98.64% retention unverified.
- [§4] §4 (experiments): The central performance numbers (77.8% token reduction, 2.15x speedup, 98.64% retention) and outperformance over baselines are presented without reported details on experimental setup, statistical tests, or verification on compositional failure cases, which directly affects whether the results support the superiority of adaptive routing over scalar pruning.
minor comments (2)
- The term 'Visual Aphasia' is introduced as a novel failure mode but lacks a precise operational definition or illustrative examples in the early sections to distinguish it from related multimodal grounding issues.
- [Abstract] The abstract states generalization across multiple LVLM families but does not list the specific backbones or token budgets tested beyond the primary setting.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and rigor in the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (method): The contrastive routing score is the load-bearing component for the claim that COAST preserves both anchor-aligned evidence and complementary context without new failure modes; however, it is described only procedurally with no equations, parameter analysis, or ablations against alternatives, leaving the reliability of the 98.64% retention unverified.
Authors: We agree that the current description of the contrastive routing score in §3 is primarily procedural and lacks explicit mathematical formulation. In the revised manuscript, we will add the full equations defining the contrastive routing score (combining anchor alignment, attention entropy, and spatial context terms), include a parameter sensitivity analysis, and provide ablations comparing it against alternative routing strategies such as simple weighted sums or attention-only baselines. These additions will directly support the reliability of the reported 98.64% performance retention. revision: yes
-
Referee: [§4] §4 (experiments): The central performance numbers (77.8% token reduction, 2.15x speedup, 98.64% retention) and outperformance over baselines are presented without reported details on experimental setup, statistical tests, or verification on compositional failure cases, which directly affects whether the results support the superiority of adaptive routing over scalar pruning.
Authors: We acknowledge the need for greater transparency in §4. The revised version will expand the experimental setup description to include full hyperparameter details, hardware specifications, and exact token budget configurations. We will add statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the performance comparisons and include targeted evaluations on compositional reasoning subsets (e.g., spatial relations and secondary object queries) to verify mitigation of Visual Aphasia relative to scalar pruning baselines. revision: yes
Circularity Check
No significant circularity: procedural framework relies on native model attention without self-referential derivations or fitted loops.
full rationale
The paper describes COAST as a training-free method that directly uses the LVLM's existing cross-modal attention and attention entropy to compute a contrastive routing score for token retention. No equations, parameter fittings, or derivations are presented that reduce by construction to the method's own outputs or to self-citations. The performance claims (77.8% token reduction, 98.64% retained accuracy) are supported by empirical results across benchmarks rather than any mathematical equivalence to inputs. The reader's assessment of no equations or derivations is consistent with the provided text, confirming the derivation chain is self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Native cross-modal attention can reliably identify query-specific anchors and estimate contextual dispersion via attention entropy
invented entities (1)
-
Visual Aphasia
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
COAST uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context... Score(ci) = SimA(ci) - SimR(ci). COAST retains top-n1 candidates as semantic evidence and bottom-n2 candidates as complementary spatial context
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Entropy-driven dynamic budgeting... H(l) = -1/log Nv * sum pj log pj ... n2 = floor(Krest (αmin + (αmax - αmin) H))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286–26296, 2023
work page 2024
-
[4]
Deepseek-v3 technical report, 2025
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, and et al. Deepseek-v3 technical report, 2025
work page 2025
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binko...
work page 2022
-
[7]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...
work page 2021
-
[9]
Visualizing and understanding patch interactions in vision transformer.IEEE Trans
Jie Ma, Yalong Bai, Bineng Zhong, Wei Zhang, Ting Yao, and Tao Mei. Visualizing and understanding patch interactions in vision transformer.IEEE Trans. Neural Networks Learn. Syst., 35(10):13671–13680, 2024
work page 2024
-
[10]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...
work page 2021
-
[11]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[12]
Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, and et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025
work page 2025
-
[13]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, 2024
work page 2024
-
[15]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[16]
Efficient transformers: A survey.ACM Comput
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.ACM Comput. Surv., 55(6):109:1–109:28, 2023
work page 2023
-
[17]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...
work page 2017
-
[18]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems P...
work page 2023
-
[19]
Token merging: Your vit but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[20]
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. [CLS] attention is all you need for training-free visual token pruning: Make VLM inference faster.CoRR, abs/2412.01818, 2024
-
[21]
Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models
Jizhihui Liu, Guangdao Zhu, and Feiyi Du. Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on ...
work page 2026
-
[22]
Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase "highlighted tokens" in mllms: Revisiting visual holistic context retention.CoRR, abs/2510.02912, 2025
-
[23]
Sparsevlm: Visual token sparsification for efficient vision-language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learning, 2025. 10 Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
work page 2025
-
[24]
Divprune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9392–9401. Computer Vision Foundation / IEEE, 2025
work page 2025
-
[25]
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.CoRR, abs/2403.15388, 2024
-
[26]
Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6700–6709. Computer Vision Foundation / IEEE, 2019
work page 2019
-
[27]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1988–1997. IEEE Computer Society, 2017
work page 2017
-
[28]
Dynamicvit: Efficient vision transformers with dynamic token sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Inform...
work page 2021
-
[29]
Don’t just assume; look and answer: Overcoming priors for visual question answering
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4971–4980. Computer Vision Foundation / IEEE Computer Society, 2018
work page 2018
-
[30]
Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering.Int. J. Comput. Vis., 127(4):398–414, 2019
work page 2019
-
[31]
Women also snowboard: Overcoming bias in captioning models
Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors,Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, Lecture Note...
work page 2018
-
[32]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 292–305. Association fo...
work page 2023
-
[33]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...
work page 2024
-
[34]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13872–13882. IEEE, 2024
work page 2024
-
[35]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9568–9578. IEEE, 2024
work page 2024
-
[36]
What’s "up" with vision-language models? investigating their struggle with spatial reasoning
Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s "up" with vision-language models? investigating their struggle with spatial reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9161–9175. Association fo...
work page 2023
-
[37]
Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges.CoRR, abs/2311.03287, 2023
-
[38]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, and et al. Qwen3 technical report, 2025. 11 Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
work page 2025
-
[39]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 19792–19802. Computer Vision Foundation / IEEE, 2025
work page 2025
-
[40]
Long Xing, Qidong Huang, Xiao wen Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.Computer Vision and Pattern Recognition Conference, abs/2410.17247, 2025
-
[41]
Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, and Xinlei Chen. Balanced token pruning: Accelerating vision language models beyond local optimization.CoRR, abs/2505.22038, 2025
-
[42]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training.CoRR, abs/...
work page internal anchor Pith review arXiv 2025
-
[43]
Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models.CoRR, abs/2509.01552, 2025
-
[44]
Mmtok: Multimodal coverage maximization for efficient inference of vlms.CoRR, abs/2508.18264, 2025
Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, and Qi Qian. Mmtok: Multimodal coverage maximization for efficient inference of vlms.CoRR, abs/2508.18264, 2025
-
[45]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.CoRR, abs/2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Confere...
work page 2024
-
[47]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in ...
work page 2016
-
[48]
Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C
Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samuel White, and Tom Yeh. Vizwiz: nearly real-time answers to visual questions. In Ken Perlin, Mary Czerwinski, and Rob Miller, editors,Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Te...
work page 2010
-
[49]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: ...
work page 2022
-
[50]
Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, March 2024. 12 Contrastive Adaptive Semantic Token Pruning for Vision-Language Models A Appendix A.1 Additional Implementation Details A.1.1 Model and Inf...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.