Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

Borui Jiang; Dehua Zheng; Yulin Zhao; Yun Wang; Zheng Zhang

arxiv: 2605.20950 · v1 · pith:5H3RRBLAnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

Yulin Zhao , Yun Wang , Dehua Zheng , Borui jiang , Zheng Zhang This is my paper

Pith reviewed 2026-05-21 05:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual token reductionvision-language modelssubject-centric pruningprogressive reductionfocus-then-contextinference efficiencyVLM accelerationtoken pruning

0 comments

The pith

A subject-centric progressive reduction method cuts visual tokens in vision-language models by first locating key subjects and then preserving their surrounding context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SPpruner to address the high computational cost of long visual token sequences in VLMs by emulating human visual perception with a focus-then-context process. It builds a focus identification module that combines visual saliency with semantic relevance to find a broad set of subjects, then applies context-aware structural scanning to link neighboring regions and maintain overall structure. This differs from prior approaches that keep only query-matched subjects in isolation. If the method works as described, it delivers faster inference while retaining most of the original performance on models such as Qwen2.5-VL and LLaVA. The experiments report clear speed and efficiency gains against existing token reduction techniques.

Core claim

The central claim is that a subject-centric progressive visual token reduction paradigm, built around an initial focus identification module modeling the interplay of visual saliency and semantic relevance followed by a context-aware structural scanning module that aggregates neighboring cues, produces higher-fidelity subject representations and better-preserved global relational dependencies than methods limited to isolated query-aligned subjects.

What carries the argument

The SPpruner paradigm's focus identification module, which excavates the full visual subject spectrum, paired with its context-aware structural scanning module that restores relational dependencies.

If this is right

Achieves up to 2.53 times speedup while retaining only 22.2 percent of visual tokens on Qwen2.5-VL.
Delivers a 67 percent FLOPs reduction on LLaVA with a 0.6 percent accuracy drop.
Outperforms prior state-of-the-art vision token reduction methods across tested models and tasks.
Maintains structural integrity of preserved subjects by incorporating contextual cues from neighboring regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged focus-then-context design could be tested on other multimodal models that handle image or video sequences.
Adaptive versions might vary the retained token percentage according to scene complexity.
The modules could be combined with quantization or pruning of the language component for further gains.

Load-bearing premise

The focus identification module can reliably detect the interplay of saliency and semantic relevance across all relevant subjects without missing context or skewing toward the query.

What would settle it

Running the reduced-token model on a set of images with multiple interacting subjects and measuring whether accuracy falls more than a few percent below the full-token baseline.

Figures

Figures reproduced from arXiv: 2605.20950 by Borui Jiang, Dehua Zheng, Yulin Zhao, Yun Wang, Zheng Zhang.

**Figure 1.** Figure 1: Comparison between existing paradigms and SPpruner. Query-centric paradigms tend to retain tokens strictly aligned with explicit text queries and inadvertently discard other salient subjects (e.g., mirror) that are critical for answering comprehensive questions. This loss of other salient subjects serves visual understanding. In contrast, SPpruner preserves a broad spectrum of visual subjects and their str… view at source ↗

**Figure 2.** Figure 2: Framework of SPpruner. (a) The focus identification module first identifies salient visual subjects by combining intrinsic visual saliency with semantic relevance to the text query. (b) The context-aware structural scanning module then employs a structure-responsive sampling mechanism to select contextual tokens associated with these identified subjects, ensuring structural integrity. (c) Construct the fin… view at source ↗

**Figure 3.** Figure 3: Ablation Studies. The performance drop without SRS confirms the necessity of adaptive retention strides, while the other metrics validate their role in saliency identification. By unifying these, SPpruner outperforms all variants to achieve 1.2×–1.5× speedups with comparable accuracy on chart and document understanding tasks. 2 4 6 8 12 14 1670 1680 1690 1700 1710 MME-Perception 2 4 6 8 12 14 600 615 630 6… view at source ↗

**Figure 4.** Figure 4: Ablation Studies. This figure shows that too few focal tokens impair holistic perception by omitting subjects, while too many reduce SRS to generic Top-K selection due to diminished contextual cues. between focal tokens and candidate tokens, enabling fast and accurate context completion without sacrificing efficiency. The selection of focal number and reduction layer. Figure 4 illustrates the impact of th… view at source ↗

**Figure 5.** Figure 5: VVisualization of token retention across increasing reduction ratios. Under a query inquiring about secondary objects (i.e., not the “Bird“), SPpruner excels in capturing a broad visual subject spectrum. Unlike baselines that discard unqueried subjects, our method successfully retains salient objects (e.g., boat) even at extreme reduction ratios. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of token retention across increasing reduction ratios. Under a query inquiring about secondary objects (i.e., not the “Glass“), SPpruner excels in capturing a broad visual subjects spectrum. Unlike baselines that discard unqueried subjects, our method successfully retains salient objects (e.g., champagne) even at extreme reduction ratios. Besides “Lime”, are there any other objects in this im… view at source ↗

**Figure 7.** Figure 7: Visualization of token retention across increasing reduction ratios. Under a query inquiring about secondary objects (i.e., not the “Lime“), SPpruner excels in capturing a broad visual subjects spectrum. Unlike baselines that discard unqueried subjects, our method successfully retains salient objects (e.g., tequila) even at extreme reduction ratios. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a two-stage token pruning method for VLMs that first targets salient subjects via saliency-relevance modeling then adds context, with reported speedups that look usable if the numbers check out.

read the letter

The main takeaway is a progressive reduction scheme called SPpruner that splits the job into focus identification followed by context-aware scanning. It tries to keep a broader set of visual subjects and their relations instead of just the query-matched bits that earlier pruning methods often lock onto. The approach is framed as copying how people look at scenes first then fill in surroundings, and the abstract spells out the two modules clearly enough to see the difference from prior isolated-subject work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SPpruner, a subject-centric progressive visual token reduction paradigm for Vision-Language Models that emulates the Focus-then-Context mechanism of human visual perception. It introduces a focus identification module to model the interplay between visual saliency and semantic relevance for preserving comprehensive visual subjects, followed by a context-aware structural scanning module to aggregate contextual cues and restore global dependencies. Extensive experiments on Qwen2.5-VL and LLaVA demonstrate superior performance over SOTA methods, with up to 2.53 times speedup at 22.2% token retention and 67% FLOPs reduction with only 0.6% accuracy drop.

Significance. If the results hold under rigorous verification, this could represent a meaningful advance in efficient VLM inference by targeting the limitation of prior token-reduction techniques that retain only query-aligned subjects. The two-stage human-inspired design is conceptually coherent, and the concrete efficiency metrics (speedup, FLOPs, token retention) on two distinct models would be a useful contribution to the field if accompanied by reproducible code and full experimental protocols.

major comments (2)

[§3.2] §3.2 (Focus Identification Module): The claim that the module 'excavates the comprehensive visual subject spectrum' by explicitly modeling saliency-relevance interplay is load-bearing for the central novelty and performance claims, yet the description does not specify whether saliency is computed independently (e.g., via a query-agnostic detector) or through cross-attention to the text query. If the latter, the module risks systematic down-weighting of salient but query-misaligned regions, directly undermining the subject-centric advantage asserted in the abstract and the reported gains.
[Experiments] Experiments section, Table 2 (Qwen2.5-VL results): The headline 2.53× speedup at 22.2% token retention and the 0.6% accuracy drop are presented without error bars, number of runs, or explicit data-split details. This is load-bearing because the soundness assessment rests on these unexamined experimental details; without them the 'consistently outperforms SOTA' claim cannot be evaluated at the required level of rigor.

minor comments (2)

[Abstract] Abstract: The acronym SPpruner is introduced without expansion or definition on first use.
[§4.1] §4.1: Notation for the context-aware scanning module could be clarified by explicitly defining the aggregation function rather than describing it procedurally.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their thoughtful and constructive feedback. We address each major comment point by point below, providing clarifications and indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: [§3.2] §3.2 (Focus Identification Module): The claim that the module 'excavates the comprehensive visual subject spectrum' by explicitly modeling saliency-relevance interplay is load-bearing for the central novelty and performance claims, yet the description does not specify whether saliency is computed independently (e.g., via a query-agnostic detector) or through cross-attention to the text query. If the latter, the module risks systematic down-weighting of salient but query-misaligned regions, directly undermining the subject-centric advantage asserted in the abstract and the reported gains.

Authors: We thank the referee for this important observation on clarity. In the Focus Identification Module, visual saliency is computed independently using a query-agnostic saliency detector (based on established CV techniques such as gradient-based or attention-map methods from a frozen backbone), while semantic relevance is modeled separately via cross-attention with the text query. The interplay is then fused to preserve the full subject spectrum. This separation explicitly avoids down-weighting salient but query-misaligned regions, directly supporting the subject-centric claim. We have revised §3.2 to explicitly state the independent saliency path, added pseudocode, and included a new diagram illustrating the two parallel streams and their fusion. revision: yes
Referee: [Experiments] Experiments section, Table 2 (Qwen2.5-VL results): The headline 2.53× speedup at 22.2% token retention and the 0.6% accuracy drop are presented without error bars, number of runs, or explicit data-split details. This is load-bearing because the soundness assessment rests on these unexamined experimental details; without them the 'consistently outperforms SOTA' claim cannot be evaluated at the required level of rigor.

Authors: We agree that greater transparency on experimental protocol is required. The reported metrics follow the standard fixed splits and evaluation protocols of the benchmarks (VQAv2, GQA, POPE, etc.). Due to the substantial compute required for full VLM inference, primary results reflect single runs per setting; however, we have now added error bars computed over three random seeds for the key Qwen2.5-VL configurations in a new supplementary table, explicitly documented the data splits and preprocessing, and expanded the experimental setup subsection. We will release code and full reproduction scripts upon acceptance to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity; method defined procedurally via new modules

full rationale

The paper proposes SPpruner as a subject-centric progressive reduction paradigm that emulates human visual perception through two explicitly constructed modules: a focus identification module modeling saliency-relevance interplay and a context-aware structural scanning module for relational dependencies. No equations, fitted parameters, or predictions are shown that reduce by construction to inputs or prior self-citations. Performance claims rest on experimental results rather than any self-referential derivation, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that human visual perception follows a reliable focus-then-context sequence and that saliency plus semantic relevance can be jointly modeled without additional learned parameters beyond those in the base VLM.

axioms (1)

domain assumption Human visual system processes scenes via initial focus on salient subjects followed by contextual integration.
Invoked to justify the two-module design in the abstract.

pith-pipeline@v0.9.0 · 5753 in / 1219 out tokens · 24410 ms · 2026-05-21T05:17:03.435897+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance... S(xi) = Φ(∥xi∥1) + Φ(R(xi | Xq))
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Structure-Responsive Sampling (SRS) mechanism... Δ = max(1, ⌊(Ntarget − |F|)·δ⌋)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 22 internal anchors

[1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Divprune: Diversity-based visual token pruning for large multimodal models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[2]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Pact: Pruning and clustering-based token reduction for faster visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[3]

highlighted tokens

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs , author=. arXiv preprint arXiv:2506.10967 , year=

work page arXiv
[4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[5]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Sparsevlm: Visual token sparsification for efficient vision-language model inference , author=. arXiv preprint arXiv:2410.04417 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2505.22654 , year=

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models , author=. arXiv preprint arXiv:2505.22654 , year=

work page arXiv
[7]

arXiv preprint arXiv:2502.11501 , year=

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? , author=. arXiv preprint arXiv:2502.11501 , year=

work page arXiv
[8]

Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

Tokenskip: Controllable chain-of-thought compression in llms , author=. arXiv preprint arXiv:2502.12067 , year=

work page arXiv
[9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Token cropr: Faster vits for quite a few tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[11]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

work page
[12]

Visual Instruction Tuning , author=

work page
[13]

Llavanext: Improved reasoning, ocr, and world knowledge , author=

work page
[14]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The Conference on Neural Information Processing Systems (NeurIPS) , year=

Attention is all you need , author=. The Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[19]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Stop looking for important tokens in multimodal language models: Duplication matters more

Stop looking for important tokens in multimodal language models: Duplication matters more , author=. arXiv preprint arXiv:2502.11494 , year=

work page arXiv
[21]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Token Merging: Your ViT But Faster

Token merging: Your vit but faster , author=. arXiv preprint arXiv:2210.09461 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

The European Conference on Computer Vision (ECCV) , year=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. The European Conference on Computer Vision (ECCV) , year=

work page
[24]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction , author=. arXiv preprint arXiv:2410.17247 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Llava-mini: Efficient image and video large mul- timodal models with one vision token,

Llava-mini: Efficient image and video large multimodal models with one vision token , author=. arXiv preprint arXiv:2501.03895 , year=

work page arXiv
[26]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[29]

The Conference on Neural Information Processing Systems (NeurIPS) , year=

Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. The Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[30]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

International Conference on Machine Learning (ICML) , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning (ICML) , year=

work page
[32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[33]

arXiv preprint arXiv:2508.06084 , year=

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance , author=. arXiv preprint arXiv:2508.06084 , year=

work page arXiv
[34]

Transformer Feed-Forward Layers Are Key-Value Memories

Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2012
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[36]

arXiv preprint arXiv:2505.13220 , year=

SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science , author=. arXiv preprint arXiv:2505.13220 , year=

work page arXiv
[37]

The European Conference on Computer Vision (ECCV) , year=

Mmbench: Is your multi-modal model an all-around player? , author=. The European Conference on Computer Vision (ECCV) , year=

work page
[38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[39]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[42]

Object Hallucination in Image Captioning

Object hallucination in image captioning , author=. arXiv preprint arXiv:1809.02156 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[44]

Journal of Comparative Neurology , volume=

Human photoreceptor topography , author=. Journal of Comparative Neurology , volume=. 1990 , publisher=

work page 1990
[45]

Journal of Vision , volume=

Visual search: A retrospective , author=. Journal of Vision , volume=. 2011 , publisher=

work page 2011
[46]

Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning

Yiqi Wang and Wentao Chen and Xiaotian Han and Xudong Lin and Haiteng Zhao and Yongfei Liu and Bohan Zhai and Jianbo Yuan and Quanzeng You and Hongxia Yang , title =. arXiv preprint arXiv:2401.06805 , year=

work page arXiv
[47]

Visual Question Answering: A Survey of Methods and Datasets

Qi Wu and Damien Teney and Peng Wang and Chunhua Shen and Anthony Dick and Anton van den Hengel , title=. arXiv preprint arXiv:1607.05910 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li and Yuechen Zhang and Chengyao Wang and Zhisheng Zhong and Yixin Chen and Ruihang Chu and Shaoteng Liu and Jiaya Jia , title=. arXiv preprint arXiv:2403.18814 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Yolo Y. Tang and Jing Bi and Siting Xu and Luchuan Song and Susan Liang and Teng Wang and Daoan Zhang and Jie An and Jingyang Lin and Rongyi Zhu and Ali Vosoughi and Chao Huang and Zeliang Zhang and Pinxin Liu and Mingqian Feng and Feng Zheng and Jianguo Zhang and Ping Luo and Jiebo Luo and Chenliang Xu , title=. arXiv preprint arXiv:2312.17432 , year=

work page arXiv
[50]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , title=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Evaluating Object Hallucination in Large Vision-Language Models , author=. The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[52]

VQA: Visual Question Answering

Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh , title=. arXiv preprint arXiv:1505.00468 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

The Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[54]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Towards VQA Models That Can Read , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[55]

Manmatha and C

Minesh Mathew and Dimosthenis Karatzas and R. Manmatha and C. V. Jawahar , title=. arXiv preprint arXiv:2007.00398 , year=

work page arXiv 2007
[56]

Science China Information Sciences , volume=

Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , title=. Science China Information Sciences , volume=

work page
[57]

International Conference on Machine Learning (ICML) , year=

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. International Conference on Machine Learning (ICML) , year=

work page
[58]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding , author=. arXiv preprint arXiv:2406.09411 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Mlvu: Benchmarking multi-task long video understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[60]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. arXiv preprint arXiv:2403.20330 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Transactions of the Association for Computational Linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[62]

Findings of the Association for Computational Linguistics (ACL) , year=

Masry, Ahmed and Long, Do and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul , title=. Findings of the Association for Computational Linguistics (ACL) , year=

work page
[63]

, author=

OpenImages: A public dataset for large-scale multi-label and multi-class image classification. , author=

work page
[64]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

Boosting multimodal large language models with visual tokens withdrawal for rapid inference , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

work page
[65]

Pangu embedded: An efficient dual-system llm reasoner with metacognition.arXiv preprint arXiv:2505.22375, 2025

Pangu embedded: An efficient dual-system llm reasoner with metacognition , author=. arXiv preprint arXiv:2505.22375 , year=

work page arXiv
[66]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Divprune: Diversity-based visual token pruning for large multimodal models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[2] [2]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Pact: Pruning and clustering-based token reduction for faster visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[3] [3]

highlighted tokens

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs , author=. arXiv preprint arXiv:2506.10967 , year=

work page arXiv

[4] [4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[5] [5]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Sparsevlm: Visual token sparsification for efficient vision-language model inference , author=. arXiv preprint arXiv:2410.04417 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2505.22654 , year=

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models , author=. arXiv preprint arXiv:2505.22654 , year=

work page arXiv

[7] [7]

arXiv preprint arXiv:2502.11501 , year=

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? , author=. arXiv preprint arXiv:2502.11501 , year=

work page arXiv

[8] [8]

Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

Tokenskip: Controllable chain-of-thought compression in llms , author=. arXiv preprint arXiv:2502.12067 , year=

work page arXiv

[9] [9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Token cropr: Faster vits for quite a few tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page

[11] [11]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

work page

[12] [12]

Visual Instruction Tuning , author=

work page

[13] [13]

Llavanext: Improved reasoning, ocr, and world knowledge , author=

work page

[14] [14]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The Conference on Neural Information Processing Systems (NeurIPS) , year=

Attention is all you need , author=. The Conference on Neural Information Processing Systems (NeurIPS) , year=

work page

[19] [19]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Stop looking for important tokens in multimodal language models: Duplication matters more

Stop looking for important tokens in multimodal language models: Duplication matters more , author=. arXiv preprint arXiv:2502.11494 , year=

work page arXiv

[21] [21]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Token Merging: Your ViT But Faster

Token merging: Your vit but faster , author=. arXiv preprint arXiv:2210.09461 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

The European Conference on Computer Vision (ECCV) , year=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. The European Conference on Computer Vision (ECCV) , year=

work page

[24] [24]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction , author=. arXiv preprint arXiv:2410.17247 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Llava-mini: Efficient image and video large mul- timodal models with one vision token,

Llava-mini: Efficient image and video large multimodal models with one vision token , author=. arXiv preprint arXiv:2501.03895 , year=

work page arXiv

[26] [26]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[28] [28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[29] [29]

The Conference on Neural Information Processing Systems (NeurIPS) , year=

Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. The Conference on Neural Information Processing Systems (NeurIPS) , year=

work page

[30] [30]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

International Conference on Machine Learning (ICML) , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning (ICML) , year=

work page

[32] [32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[33] [33]

arXiv preprint arXiv:2508.06084 , year=

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance , author=. arXiv preprint arXiv:2508.06084 , year=

work page arXiv

[34] [34]

Transformer Feed-Forward Layers Are Key-Value Memories

Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2012

[35] [35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[36] [36]

arXiv preprint arXiv:2505.13220 , year=

SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science , author=. arXiv preprint arXiv:2505.13220 , year=

work page arXiv

[37] [37]

The European Conference on Computer Vision (ECCV) , year=

Mmbench: Is your multi-modal model an all-around player? , author=. The European Conference on Computer Vision (ECCV) , year=

work page

[38] [38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[39] [39]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page

[42] [42]

Object Hallucination in Image Captioning

Object hallucination in image captioning , author=. arXiv preprint arXiv:1809.02156 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[44] [44]

Journal of Comparative Neurology , volume=

Human photoreceptor topography , author=. Journal of Comparative Neurology , volume=. 1990 , publisher=

work page 1990

[45] [45]

Journal of Vision , volume=

Visual search: A retrospective , author=. Journal of Vision , volume=. 2011 , publisher=

work page 2011

[46] [46]

Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning

Yiqi Wang and Wentao Chen and Xiaotian Han and Xudong Lin and Haiteng Zhao and Yongfei Liu and Bohan Zhai and Jianbo Yuan and Quanzeng You and Hongxia Yang , title =. arXiv preprint arXiv:2401.06805 , year=

work page arXiv

[47] [47]

Visual Question Answering: A Survey of Methods and Datasets

Qi Wu and Damien Teney and Peng Wang and Chunhua Shen and Anthony Dick and Anton van den Hengel , title=. arXiv preprint arXiv:1607.05910 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li and Yuechen Zhang and Chengyao Wang and Zhisheng Zhong and Yixin Chen and Ruihang Chu and Shaoteng Liu and Jiaya Jia , title=. arXiv preprint arXiv:2403.18814 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Yolo Y. Tang and Jing Bi and Siting Xu and Luchuan Song and Susan Liang and Teng Wang and Daoan Zhang and Jie An and Jingyang Lin and Rongyi Zhu and Ali Vosoughi and Chao Huang and Zeliang Zhang and Pinxin Liu and Mingqian Feng and Feng Zheng and Jianguo Zhang and Ping Luo and Jiebo Luo and Chenliang Xu , title=. arXiv preprint arXiv:2312.17432 , year=

work page arXiv

[50] [50]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , title=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Evaluating Object Hallucination in Large Vision-Language Models , author=. The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page

[52] [52]

VQA: Visual Question Answering

Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh , title=. arXiv preprint arXiv:1505.00468 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

The Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The Conference on Neural Information Processing Systems (NeurIPS) , year=

work page

[54] [54]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Towards VQA Models That Can Read , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[55] [55]

Manmatha and C

Minesh Mathew and Dimosthenis Karatzas and R. Manmatha and C. V. Jawahar , title=. arXiv preprint arXiv:2007.00398 , year=

work page arXiv 2007

[56] [56]

Science China Information Sciences , volume=

Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , title=. Science China Information Sciences , volume=

work page

[57] [57]

International Conference on Machine Learning (ICML) , year=

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. International Conference on Machine Learning (ICML) , year=

work page

[58] [58]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding , author=. arXiv preprint arXiv:2406.09411 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Mlvu: Benchmarking multi-task long video understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[60] [60]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. arXiv preprint arXiv:2403.20330 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Transactions of the Association for Computational Linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=

work page

[62] [62]

Findings of the Association for Computational Linguistics (ACL) , year=

Masry, Ahmed and Long, Do and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul , title=. Findings of the Association for Computational Linguistics (ACL) , year=

work page

[63] [63]

, author=

OpenImages: A public dataset for large-scale multi-label and multi-class image classification. , author=

work page

[64] [64]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

Boosting multimodal large language models with visual tokens withdrawal for rapid inference , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

work page

[65] [65]

Pangu embedded: An efficient dual-system llm reasoner with metacognition.arXiv preprint arXiv:2505.22375, 2025

Pangu embedded: An efficient dual-system llm reasoner with metacognition , author=. arXiv preprint arXiv:2505.22375 , year=

work page arXiv

[66] [66]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv