pith. machine review for the scientific record. sign in

arxiv: 2604.06912 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Q-Zoommultimodal large language modelsadaptive perceptionhigh-resolution processingregion of interestdynamic gatingefficient inferencequery-aware
0
0 comments X

The pith

Q-Zoom lets multimodal LLMs process only query-relevant high-resolution image regions to speed up inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models face efficiency limits because they always feed full high-resolution images into quadratic attention, creating redundant tokens even for queries that need detail only in small areas. Q-Zoom counters this with a coarse-to-fine pipeline: a lightweight gating network skips high-resolution entirely when global low-res features are enough, while a self-distilled proposal network finds the exact task-relevant region when fine detail matters. Training uses consistency labels for the gate and fully self-supervised distillation for the proposer, followed by alignment to merge local and global features. On Qwen2.5-VL-7B the method delivers 2.52 times faster inference on document and OCR tasks and 4.39 times faster in high-resolution settings while matching or beating baseline accuracy, and the gains carry over to other model families.

Core claim

Q-Zoom is a query-aware adaptive high-resolution perception framework for MLLMs that operates in an efficient coarse-to-fine manner. A lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. For queries that need fine-grained perception, a Self-Distilled Region Proposal Network precisely localizes the task-relevant Region-of-Interest directly from intermediate feature spaces. These modules are optimized with a consistency-aware generation strategy for routing labels and a fully self-supervised distillation paradigm for the proposer. A continuous spatio-temporal alignment scheme then fuses the dense local RoI with the coarse global布局.

What carries the argument

The Dynamic Gating Network for deciding high-resolution use plus the Self-Distilled Region Proposal Network for localizing query-relevant RoIs, trained without external labels and fused via spatio-temporal alignment.

If this is right

  • Delivers 2.52 times faster inference on Document and OCR benchmarks while matching the baseline peak accuracy.
  • Achieves 4.39 times speedup in high-resolution scenarios with matching accuracy.
  • Can exceed baseline accuracy by 1.1 percent on document tasks and 8.1 percent on high-resolution tasks when configured for maximum fidelity.
  • The same speed and accuracy benefits transfer directly to Qwen3-VL, LLaVA, and RL-based thinking-with-image models without architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query-driven gating idea could be tested on video or 3D inputs where only certain frames or viewpoints need high detail.
  • Self-supervised distillation of the region proposer may allow similar adaptive modules to be added to existing models with minimal new labeled data.
  • Combining Q-Zoom style routing with token compression techniques might further reduce memory use for very large images or long contexts.

Load-bearing premise

The gating and proposal networks can correctly decide when high-resolution is needed and accurately locate the relevant image region without adding meaningful overhead or localization mistakes that would hurt final performance.

What would settle it

A controlled test on a fine-grained document query where the gating network routes to low-resolution only, producing a clear accuracy drop relative to the always-high-resolution baseline on the same model.

Figures

Figures reproduced from arXiv: 2604.06912 by Chang Xu, Linfeng Wen, Minjing Dong, Xiaohuan Pei, Yuheng Shi.

Figure 1
Figure 1. Figure 1: Comparison of adaptive high-resolution perception paradigms. Training-free methods rely on handcrafted contrastive rules, requiring multiple redundant prefilling passes. RL-based methods use the LLM to auto-regressively generate code or coordinates to find the RoI. Our Q-Zoom framework operates directly on the intermediate feature space during a single prefilling pass, yielding superior efficiency. a Dynam… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Adaptive High-Resolution Perception Framework. (a) The framework derives robust supervisory signals through consistency-aware generation to train a lightweight gating module. (b) During inference, the gate dynamically evaluates the textual query. It routes simpler queries for direct, accelerated generation using coarse features, while triggering the SD-RPN for complex queries to ex… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the conditional Region-of-Interest extraction pipeline. When triggered by the dynamic gating module, the SD-RPN (top) leverages shared intermediate features from the frozen backbone to efficiently generate a dense spatial heatmap. During the training phase (bottom), the network is optimized through a self-distillation paradigm, utilizing denoised cross-modal attention maps from the base MLLM as… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our pseudo-label generation pipeline. Raw attention maps from the MLLM are denoised by removing sink tokens, followed by a tri-state label assignment that isolates high-confidence foreground (FG) and background (BG) tokens while ignoring ambiguous intermediate regions. Layer index is omitted for brevity [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the Spatio-Temporal Alignment and Targeted Post-SFT pipeline. The vision encoder and projec￾tor are omitted for visual brevity. where i ∈ {0, . . . , H′ − 1} and j ∈ {0, . . . , W′ − 1}. This formulation guarantees that the dense RoI tokens remain explicitly grounded within their original global coordinates. Targeted Post-Supervised Fine-Tuning (Post-SFT). Even with rigorous positional alignmen… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparisons on challenging examples from TextVQA (left) and V* Bench (right). These examples highlight visually demanding scenarios where the target evidence is small or obscured. The baseline Qwen2.5-VL-7B suffers from resolution compression, whereas our Q-Zoom framework successfully leverages the SD-RPN to predict highly accurate RoI heatmaps, cropping the necessary fine-grained details to ge… view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy vs. Efficiency Trade-offs. We evaluate the Qwen2.5-VL-7B baseline against Q-Zoom by sweeping the maximum visual token limit in a image from 256 to 4,096. Our framework establishes a dominant Pareto frontier on both (a) Document & OCR and (b) High-Resolution benchmark categories. By adaptively localizing RoIs, Q-Zoom surpasses the peak accuracy of the brute-force 4,096-token baseline while reducing… view at source ↗
Figure 9
Figure 9. Figure 9: Ablation of the Consistency-aware Training Sample Generation upon Qwen2.5-VL 7B. (Top) Training loss curves of the dynamic gating network. (Bottom) The Pareto front illustrating the trade-off between perception accuracy and inference efficiency (No-RoI Ratio). ble VIII ablates the pseudo-label foreground (τf g) and back￾ground (τbg) thresholds. Setting the two thresholds to the same value (e.g., τf g = τbg… view at source ↗
read the original abstract

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Q-Zoom, a query-aware adaptive perception framework for efficient MLLMs. It employs a lightweight Dynamic Gating Network to bypass high-resolution processing when coarse features suffice, and a Self-Distilled Region Proposal Network (SD-RPN) to localize task-relevant RoIs from intermediate features using self-supervised distillation. A continuous spatio-temporal alignment fuses the local RoI with global layout. On Qwen2.5-VL-7B, it reports 2.52× inference speedup on Document & OCR benchmarks and 4.39× in high-resolution scenarios while matching baseline peak accuracy; a maximum-fidelity configuration exceeds baseline by 1.1% and 8.1% respectively. Gains transfer to Qwen3-VL, LLaVA, and RL-based models.

Significance. If the empirical claims hold under rigorous validation, Q-Zoom would meaningfully advance efficient high-resolution perception in MLLMs by exploiting query intent and spatial sparsity rather than uniform token scaling. The fully self-supervised training of both the gating network (via consistency-aware labels) and SD-RPN (via distillation) avoids extra supervision costs, which is a practical strength. Transferability across model families further supports potential impact on deployment of vision-language systems.

major comments (3)
  1. [Methods (SD-RPN description)] Methods section on SD-RPN: the claim that the self-distilled proposals 'precisely localize' task-relevant RoIs from coarse intermediate features lacks supporting localization metrics (e.g., IoU or recall on ground-truth regions) or failure-case analysis on fine-grained OCR/Document benchmarks. If proposals systematically miss small text or objects, the subsequent fusion cannot recover detail, directly undermining the 'match or surpass peak accuracy' result.
  2. [Methods (Dynamic Gating Network)] Dynamic Gating Network subsection: the consistency-aware generation of deterministic routing labels assumes coarse global features already encode reliable high-res vs. low-res signals. No ablation or accuracy breakdown is provided for gating decisions (e.g., false-negative rate when high-res is needed), so it is unclear whether the reported 2.52×/4.39× speedups trade off hidden accuracy on edge cases.
  3. [Experiments] Experiments and results: the headline speedups and accuracy numbers are presented without full ablation tables isolating the contribution of gating vs. SD-RPN vs. fusion, nor details on benchmark splits, hyperparameter selection, or statistical significance. This makes it difficult to rule out selection bias or post-hoc tuning as noted in the soundness assessment.
minor comments (2)
  1. [Abstract] The abstract states that Q-Zoom 'establishes a dominant Pareto frontier' but no corresponding figure or table is referenced; a Pareto plot comparing throughput vs. accuracy against baselines would strengthen the claim.
  2. [Methods (fusion)] Notation for the continuous spatio-temporal alignment scheme is introduced without an equation or diagram; a concise formulation would clarify how local RoI features are merged with global layout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We will make the indicated revisions to strengthen the paper's rigor, clarity, and empirical support.

read point-by-point responses
  1. Referee: Methods section on SD-RPN: the claim that the self-distilled proposals 'precisely localize' task-relevant RoIs from coarse intermediate features lacks supporting localization metrics (e.g., IoU or recall on ground-truth regions) or failure-case analysis on fine-grained OCR/Document benchmarks. If proposals systematically miss small text or objects, the subsequent fusion cannot recover detail, directly undermining the 'match or surpass peak accuracy' result.

    Authors: We acknowledge that the current manuscript lacks explicit quantitative localization metrics such as IoU or recall for the SD-RPN outputs. This stems from the fully self-supervised distillation training paradigm, which does not use ground-truth region annotations. To address the concern, we will add qualitative visualizations of RoI proposals, failure-case analysis on OCR and document benchmarks, and proxy evaluations (e.g., downstream accuracy sensitivity to proposal quality thresholds). We will also revise the Methods section to better contextualize the 'precise localization' claim as being validated by end-to-end task performance rather than direct spatial metrics. These changes will be included in the revised version. revision: yes

  2. Referee: Dynamic Gating Network subsection: the consistency-aware generation of deterministic routing labels assumes coarse global features already encode reliable high-res vs. low-res signals. No ablation or accuracy breakdown is provided for gating decisions (e.g., false-negative rate when high-res is needed), so it is unclear whether the reported 2.52×/4.39× speedups trade off hidden accuracy on edge cases.

    Authors: We agree that additional analysis of the gating decisions would improve transparency. In the revised manuscript, we will include a dedicated ablation that reports accuracy breakdowns for gated vs. forced high-resolution paths, along with the false-negative rate (instances where high-resolution processing is required but the gate selects the low-resolution path). This will demonstrate the reliability of the consistency-aware label generation and confirm that the reported speedups do not mask accuracy losses on edge cases. Examples of gating behavior on challenging inputs will also be added. revision: yes

  3. Referee: Experiments and results: the headline speedups and accuracy numbers are presented without full ablation tables isolating the contribution of gating vs. SD-RPN vs. fusion, nor details on benchmark splits, hyperparameter selection, or statistical significance. This makes it difficult to rule out selection bias or post-hoc tuning as noted in the soundness assessment.

    Authors: We will expand the Experiments section with full ablation tables that isolate the individual and combined contributions of the Dynamic Gating Network, SD-RPN, and spatio-temporal fusion components. We will also add details on the benchmark dataset splits, the hyperparameter selection procedure (including validation strategies), and statistical significance measures such as means and standard deviations across multiple runs. These additions will help address potential concerns about selection bias or post-hoc tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are empirical results on external benchmarks

full rationale

The paper proposes Q-Zoom with a Dynamic Gating Network and Self-Distilled Region Proposal Network trained via consistency-aware labeling and self-supervised distillation, then fuses outputs and reports speedups (2.52× / 4.39×) plus accuracy matching or exceeding baselines on Document & OCR and high-resolution benchmarks. These metrics are obtained from direct experimental evaluation on held-out test sets using Qwen2.5-VL-7B and other models; no equations, fitted parameters, or self-citations reduce the reported throughput or accuracy figures to the inputs by construction. The derivation chain consists of standard training and inference procedures whose outputs are independently measurable against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described in sufficient detail to list; the new modules are introduced conceptually without quantified assumptions or external validation handles.

pith-pipeline@v0.9.0 · 5631 in / 1280 out tokens · 50834 ms · 2026-05-10T18:42:38.856391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 37 canonical work pages · 18 internal anchors

  1. [1]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

  2. [2]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

  4. [4]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025

  5. [5]

    Thyme: Think Beyond Images

    Y .-F. Zhang, X. Lu, S. Yin, C. Fu, W. Chen, X. Hu, B. Wen, K. Jiang, C. Liu, T. Zhanget al., “Thyme: Think beyond images,”arXiv preprint arXiv:2508.11630, 2025. PREPRINT 13

  6. [6]

    Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

    X. Lai, J. Li, W. Li, T. Liu, T. Li, and H. Zhao, “Mini-o3: Scaling up reasoning patterns and interaction turns for visual search,”arXiv preprint arXiv:2509.07969, 2025

  7. [7]

    Monkey: Image resolution and text label are important things for large multi-modal models,

    Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” inCVPR, 2024

  8. [8]

    Huang, H

    M. Huang, Y . Liu, D. Liang, L. Jin, and X. Bai, “Mini-monkey: Alleviating the semantic sawtooth effect for lightweight mllms via com- plementary image pyramid,”arXiv preprint arXiv:2408.02034, 2024

  9. [9]

    DeepSeek-OCR: Contexts Optical Compression

    H. Wei, Y . Sun, and Y . Li, “Deepseek-ocr: Contexts optical compres- sion,”arXiv preprint arXiv:2510.18234, 2025

  10. [10]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

  11. [11]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  12. [12]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “Pi0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  13. [13]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,”NeurIPS, 2023

  14. [14]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” 2023

  15. [15]

    Llavanext: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llavanext: Improved reasoning, ocr, and world knowledge,” 2024

  16. [16]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  17. [17]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  18. [18]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021

  19. [19]

    Llava-uhd v2: an mllm integrating high-resolution feature pyramid via hierarchical window transformer.arXiv preprint arXiv:2412.13871, 2024

    Y . Zhang, Y . Liu, Z. Guo, Y . Zhang, X. Yang, C. Chen, J. Song, B. Zheng, Y . Yao, Z. Liuet al., “Llava-uhd v2: an mllm integrating high-resolution feature pyramid via hierarchical window transformer,” arXiv preprint arXiv:2412.13871, 2024

  20. [20]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  21. [21]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 185–24 198

  22. [22]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,

    M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsinet al., “Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,” NeurIPS, 2023

  23. [23]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms,

    J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” inICLR, 2025

  24. [24]

    Focus: Internal mllm representations for efficient fine- grained visual question answering,

    L. Zhong, F. Rosenthal, J. Sicking, F. H ¨uger, T. Bagdonat, H. Gottschalk, and L. Schwinn, “Focus: Internal mllm representations for efficient fine- grained visual question answering,” inNeurIPS, 2025

  25. [25]

    Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration,

    H. Shen, K. Zhao, T. Zhao, R. Xu, Z. Zhang, M. Zhu, and J. Yin, “Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  26. [26]

    Thinking with images,

    OpenAI, “Thinking with images,” https://openai.com/index/ thinking-with-images/, 2025

  27. [27]

    Vision function layer in multimodal llms,

    C. Shi, Y . Yu, and S. Yang, “Vision function layer in multimodal llms,” inNeurIPS, 2025

  28. [28]

    Your large vision-language model only needs a few attention heads for visual grounding,

    S. Kang, J. Kim, J. Kim, and S. J. Hwang, “Your large vision-language model only needs a few attention heads for visual grounding,” inCVPR, 2025

  29. [29]

    Mathew, D

    M. Mathew, D. Karatzas, R. Manmatha, and C. Jawahar, “Docvqa: A dataset for vqa on document images. corr abs/2007.00398 (2020),”arXiv preprint arXiv:2007.00398, 2020

  30. [30]

    Infographicvqa,

    M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” inWACV, 2022

  31. [31]

    Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022

  32. [32]

    Ocr-vqa: Visual question answering by reading text in images,

    A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answering by reading text in images,” in2019 inter- national conference on document analysis and recognition (ICDAR). IEEE, 2019

  33. [33]

    V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

    P. Wu and S. Xie, “V*: Guided visual search as a core mechanism in multimodal llms,”arXiv preprint arXiv:2312.14135, 2023

  34. [34]

    Divide, conquer and combine: A training-free framework for high- resolution image perception in multimodal large language models,

    W. Wang, L. Ding, M. Zeng, X. Zhou, L. Shen, Y . Luo, and D. Tao, “Divide, conquer and combine: A training-free framework for high- resolution image perception in multimodal large language models,” arXiv preprint, 2024

  35. [35]

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?

    Y .-F. Zhang, H. Zhang, H. Tian, C. Fu, S. Zhang, J. Wu, F. Li, K. Wang, Q. Wen, Z. Zhanget al., “Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?” inICLR, 2025

  36. [36]

    arXiv preprint arXiv:2602.11858 , year=

    L. Wei, L. He, J. Lan, L. Dong, Y . Cai, S. Li, H. Zhu, W. Wang, L. Kong, Y . Wang, Z. Zhang, and W. Huang, “Zooming without zooming: Region- to-image distillation for fine-grained multimodal perception,”arXiv preprint arXiv:2602.11858, 2026

  37. [37]

    Catching the details: Self-distilled roi predictors for fine-grained mllm perception,

    Y . Shi, X. Pei, M. Dong, and C. Xu, “Catching the details: Self-distilled roi predictors for fine-grained mllm perception,” inICLR, 2026

  38. [38]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning, 2023

  39. [39]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023

  40. [40]

    Perception Encoder: The best visual embeddings are not at the output of the network

    D. Bolya, P.-Y . Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. Rasheedet al., “Perception encoder: The best visual embeddings are not at the output of the network,”arXiv preprint arXiv:2504.13181, 2025

  41. [41]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yanget al., “Deepseek-vl: towards real-world vision-language understanding,”arXiv preprint arXiv:2403.05525, 2024

  42. [42]

    Fastvlm: Efficient vision encoding for vision language models,

    P. K. A. Vasu, F. Faghri, C.-L. Li, C. Koc, N. True, A. Antony, G. Santhanam, J. Gabriel, P. Grasch, O. Tuzelet al., “Fastvlm: Efficient vision encoding for vision language models,” inCVPR, 2025

  43. [43]

    Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

    G. Luo, Y . Zhou, Y . Zhang, X. Zheng, X. Sun, and R. Ji, “Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models,”arXiv preprint arXiv:2403.03003, 2024

  44. [44]

    Mg-llava: Towards multi-granularity visual instruction tuning,

    X. Zhao, X. Li, H. Duan, H. Huang, Y . Li, K. Chen, and H. Yang, “Mg-llava: Towards multi-granularity visual instruction tuning,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  45. [45]

    arXiv:2403.18814 , year=

    Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chen, R. Chu, S. Liu, and J. Jia, “Mini-gemini: Mining the potential of multi-modality vision language models,”arXiv preprint arXiv:2403.18814, 2024

  46. [46]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inICCV, 2023

  47. [47]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inCVPR, 2022

  48. [48]

    Honeybee: Locality-enhanced projector for multimodal llm,

    J. Cha, W. Kang, J. Mun, and B. Roh, “Honeybee: Locality-enhanced projector for multimodal llm,” inCVPR, 2024

  49. [49]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  50. [50]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    X. An, Y . Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y . Wang, S. Xu, C. Chen, D. Zhuet al., “Llava-onevision-1.5: Fully open framework for democratized multimodal training,”arXiv preprint arXiv:2509.23661, 2025

  51. [51]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wanget al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding,”arXiv preprint arXiv:2412.10302, 2024

  52. [52]

    5 technical report

    S. Lu, Y . Li, Y . Xia, Y . Hu, S. Zhao, Y . Ma, Z. Wei, Y . Li, L. Duan, J. Zhaoet al., “Ovis2. 5 technical report,”arXiv preprint arXiv:2508.11737, 2025

  53. [53]

    Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling,

    X. Liu, Y . Hu, Y . Zou, L. Wu, J. Xu, and B. Zheng, “Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling,” arXiv preprint arXiv:2510.00054, 2025. PREPRINT 14

  54. [54]

    Focusing by contrastive attention: Enhancing vlms’ visual reasoning,

    Y . Ge, S. Liu, Y . Wang, L. Mei, B. Bi, X. Zhou, J. Yao, J. Guo, and X. Cheng, “Focusing by contrastive attention: Enhancing vlms’ visual reasoning,”arXiv preprint arXiv:2509.06461, 2025

  55. [55]

    Seeing but not believing: Vision–language models can attend correctly yet reason incorrectly.arXiv preprint arXiv:2510.17771, 2025

    Z. Liu, Z. Chen, H. Liu, C. Luo, X. Tang, S. Wang, J. Zeng, Z. Dai, Z. Shi, T. Weiet al., “Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms,”arXiv preprint arXiv:2510.17771, 2025

  56. [56]

    Token-efficient vlm: High-resolution image understanding via dynamic region proposal,

    Y . Jiang, J. Gu, T. Xue, K. C. Cheung, P. Molchanov, H. Yin, and S. Liu, “Token-efficient vlm: High-resolution image understanding via dynamic region proposal,” inICCV, 2025

  57. [57]

    Scaling vision pre-training to 4k resolution,

    B. Shi, B. Li, H. Cai, Y . Lu, S. Liu, M. Pavone, J. Kautz, S. Han, T. Darrell, P. Molchanovet al., “Scaling vision pre-training to 4k resolution,” inCVPR, 2025

  58. [58]

    Hypervl: An efficient and dynamic multimodal large language model for edge devices,

    H. Team, Y . Liu, K. Han, Z. Xia, Y . Dong, C. Song, K. Tang, J. Xu, X. Feng, W. Yuet al., “Hypervl: An efficient and dynamic multimodal large language model for edge devices,”arXiv preprint arXiv:2512.14052, 2025

  59. [59]

    On the faithfulness of visual thinking: Measurement and enhancement,

    Z. Liu, J. Pan, Q. She, Y . Gao, and G. Xia, “On the faithfulness of visual thinking: Measurement and enhancement,”arXiv preprint arXiv:2510.23482, 2025

  60. [60]

    Thinking with images via self- calling agent,

    W. Yang, Y . Zhao, F. Wan, and Q. Ye, “Thinking with images via self- calling agent,”arXiv preprint arXiv:2512.08511, 2025

  61. [61]

    Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

    Q. Wang, Y . Shi, Y . Wang, Y . Zhang, P. Wan, K. Gai, X. Ying, and Y . Wang, “Monet: Reasoning in latent visual space beyond images and language,”arXiv preprint arXiv:2511.21395, 2025

  62. [62]

    Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a

    B. Li, X. Sun, J. Liu, Z. Wang, J. Wu, X. Yu, H. Chen, E. Bar- soum, M. Chen, and Z. Liu, “Latent visual reasoning,”arXiv preprint arXiv:2509.24251, 2025

  63. [63]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

  64. [64]

    Root mean square layer normalization,

    B. Zhang and R. Sennrich, “Root mean square layer normalization,” NeurIPS, 2019

  65. [65]

    Vision Transformers Need Registers

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,”arXiv preprint arXiv:2309.16588, 2023

  66. [66]

    See what you are told: Visual attention sink in large multimodal models,

    S. Kang, J. Kim, J. Kim, and S. J. Hwang, “See what you are told: Visual attention sink in large multimodal models,” inICLR, 2025

  67. [67]

    When do we not need larger vision models?

    B. Shi, Z. Wu, M. Mao, X. Wang, and T. Darrell, “When do we not need larger vision models?” inECCV, 2024

  68. [68]

    Visionthink: Smart and efficient vision language model via reinforcement learning,

    S. Yang, J. Li, X. Lai, B. Yu, H. Zhao, and J. Jia, “Visionthink: Smart and efficient vision language model via reinforcement learning,” inNeurIPS, 2025

  69. [69]

    Adaptvision: Efficient vision-language models via adaptive visual acquisition,

    Z. Lin, Y . Liu, Y . Yang, L. Tao, and D. Ye, “Adaptvision: Efficient vision-language models via adaptive visual acquisition,” inCVPR, 2026

  70. [70]

    Deepeyesv2: Toward agentic multimodal model,

    J. Hong, C. Zhao, C. Zhu, W. Lu, G. Xu, and X. Yu, “Deepeyesv2: Toward agentic multimodal model,” inICLR, 2026

  71. [71]

    Ocrbench: on the hidden mystery of ocr in large multimodal models,

    Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.-L. Liu, L. Jin, and X. Bai, “Ocrbench: on the hidden mystery of ocr in large multimodal models,”Science China Information Sciences, 2024

  72. [72]

    Towards vqa models that can read,

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, 2019

  73. [73]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023

  74. [74]

    Are we on the right way for evaluating large vision-language models?

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are we on the right way for evaluating large vision-language models?” inNeurIPS, 2024

  75. [75]

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al

    K. Zhang, B. Li, P. Zhang, F. Pu, J. A. Cahyono, K. Hu, S. Liu, Y . Zhang, J. Yang, C. Li, and Z. Liu, “Lmms-eval: Reality check on the evaluation of large multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12772

  76. [76]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  77. [77]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inECCV, 2024

  78. [78]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” NeurIPS, 2024

  79. [79]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering,

    D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, 2019. PREPRINT 15 APPENDIXA IMPLEMENTATION ANDPROMPTDETAILS A. More Implementation Details Training Configurations.The optimization hyperparame- ters for the three core components of Q-Zoom are detailed in Table IX. Across all...