Recognition: 2 theorem links
· Lean TheoremQ-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Pith reviewed 2026-05-10 18:42 UTC · model grok-4.3
The pith
Q-Zoom lets multimodal LLMs process only query-relevant high-resolution image regions to speed up inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Q-Zoom is a query-aware adaptive high-resolution perception framework for MLLMs that operates in an efficient coarse-to-fine manner. A lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. For queries that need fine-grained perception, a Self-Distilled Region Proposal Network precisely localizes the task-relevant Region-of-Interest directly from intermediate feature spaces. These modules are optimized with a consistency-aware generation strategy for routing labels and a fully self-supervised distillation paradigm for the proposer. A continuous spatio-temporal alignment scheme then fuses the dense local RoI with the coarse global布局.
What carries the argument
The Dynamic Gating Network for deciding high-resolution use plus the Self-Distilled Region Proposal Network for localizing query-relevant RoIs, trained without external labels and fused via spatio-temporal alignment.
If this is right
- Delivers 2.52 times faster inference on Document and OCR benchmarks while matching the baseline peak accuracy.
- Achieves 4.39 times speedup in high-resolution scenarios with matching accuracy.
- Can exceed baseline accuracy by 1.1 percent on document tasks and 8.1 percent on high-resolution tasks when configured for maximum fidelity.
- The same speed and accuracy benefits transfer directly to Qwen3-VL, LLaVA, and RL-based thinking-with-image models without architecture changes.
Where Pith is reading between the lines
- The same query-driven gating idea could be tested on video or 3D inputs where only certain frames or viewpoints need high detail.
- Self-supervised distillation of the region proposer may allow similar adaptive modules to be added to existing models with minimal new labeled data.
- Combining Q-Zoom style routing with token compression techniques might further reduce memory use for very large images or long contexts.
Load-bearing premise
The gating and proposal networks can correctly decide when high-resolution is needed and accurately locate the relevant image region without adding meaningful overhead or localization mistakes that would hurt final performance.
What would settle it
A controlled test on a fine-grained document query where the gating network routes to low-resolution only, producing a clear accuracy drop relative to the always-high-resolution baseline on the same model.
Figures
read the original abstract
MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Q-Zoom, a query-aware adaptive perception framework for efficient MLLMs. It employs a lightweight Dynamic Gating Network to bypass high-resolution processing when coarse features suffice, and a Self-Distilled Region Proposal Network (SD-RPN) to localize task-relevant RoIs from intermediate features using self-supervised distillation. A continuous spatio-temporal alignment fuses the local RoI with global layout. On Qwen2.5-VL-7B, it reports 2.52× inference speedup on Document & OCR benchmarks and 4.39× in high-resolution scenarios while matching baseline peak accuracy; a maximum-fidelity configuration exceeds baseline by 1.1% and 8.1% respectively. Gains transfer to Qwen3-VL, LLaVA, and RL-based models.
Significance. If the empirical claims hold under rigorous validation, Q-Zoom would meaningfully advance efficient high-resolution perception in MLLMs by exploiting query intent and spatial sparsity rather than uniform token scaling. The fully self-supervised training of both the gating network (via consistency-aware labels) and SD-RPN (via distillation) avoids extra supervision costs, which is a practical strength. Transferability across model families further supports potential impact on deployment of vision-language systems.
major comments (3)
- [Methods (SD-RPN description)] Methods section on SD-RPN: the claim that the self-distilled proposals 'precisely localize' task-relevant RoIs from coarse intermediate features lacks supporting localization metrics (e.g., IoU or recall on ground-truth regions) or failure-case analysis on fine-grained OCR/Document benchmarks. If proposals systematically miss small text or objects, the subsequent fusion cannot recover detail, directly undermining the 'match or surpass peak accuracy' result.
- [Methods (Dynamic Gating Network)] Dynamic Gating Network subsection: the consistency-aware generation of deterministic routing labels assumes coarse global features already encode reliable high-res vs. low-res signals. No ablation or accuracy breakdown is provided for gating decisions (e.g., false-negative rate when high-res is needed), so it is unclear whether the reported 2.52×/4.39× speedups trade off hidden accuracy on edge cases.
- [Experiments] Experiments and results: the headline speedups and accuracy numbers are presented without full ablation tables isolating the contribution of gating vs. SD-RPN vs. fusion, nor details on benchmark splits, hyperparameter selection, or statistical significance. This makes it difficult to rule out selection bias or post-hoc tuning as noted in the soundness assessment.
minor comments (2)
- [Abstract] The abstract states that Q-Zoom 'establishes a dominant Pareto frontier' but no corresponding figure or table is referenced; a Pareto plot comparing throughput vs. accuracy against baselines would strengthen the claim.
- [Methods (fusion)] Notation for the continuous spatio-temporal alignment scheme is introduced without an equation or diagram; a concise formulation would clarify how local RoI features are merged with global layout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We will make the indicated revisions to strengthen the paper's rigor, clarity, and empirical support.
read point-by-point responses
-
Referee: Methods section on SD-RPN: the claim that the self-distilled proposals 'precisely localize' task-relevant RoIs from coarse intermediate features lacks supporting localization metrics (e.g., IoU or recall on ground-truth regions) or failure-case analysis on fine-grained OCR/Document benchmarks. If proposals systematically miss small text or objects, the subsequent fusion cannot recover detail, directly undermining the 'match or surpass peak accuracy' result.
Authors: We acknowledge that the current manuscript lacks explicit quantitative localization metrics such as IoU or recall for the SD-RPN outputs. This stems from the fully self-supervised distillation training paradigm, which does not use ground-truth region annotations. To address the concern, we will add qualitative visualizations of RoI proposals, failure-case analysis on OCR and document benchmarks, and proxy evaluations (e.g., downstream accuracy sensitivity to proposal quality thresholds). We will also revise the Methods section to better contextualize the 'precise localization' claim as being validated by end-to-end task performance rather than direct spatial metrics. These changes will be included in the revised version. revision: yes
-
Referee: Dynamic Gating Network subsection: the consistency-aware generation of deterministic routing labels assumes coarse global features already encode reliable high-res vs. low-res signals. No ablation or accuracy breakdown is provided for gating decisions (e.g., false-negative rate when high-res is needed), so it is unclear whether the reported 2.52×/4.39× speedups trade off hidden accuracy on edge cases.
Authors: We agree that additional analysis of the gating decisions would improve transparency. In the revised manuscript, we will include a dedicated ablation that reports accuracy breakdowns for gated vs. forced high-resolution paths, along with the false-negative rate (instances where high-resolution processing is required but the gate selects the low-resolution path). This will demonstrate the reliability of the consistency-aware label generation and confirm that the reported speedups do not mask accuracy losses on edge cases. Examples of gating behavior on challenging inputs will also be added. revision: yes
-
Referee: Experiments and results: the headline speedups and accuracy numbers are presented without full ablation tables isolating the contribution of gating vs. SD-RPN vs. fusion, nor details on benchmark splits, hyperparameter selection, or statistical significance. This makes it difficult to rule out selection bias or post-hoc tuning as noted in the soundness assessment.
Authors: We will expand the Experiments section with full ablation tables that isolate the individual and combined contributions of the Dynamic Gating Network, SD-RPN, and spatio-temporal fusion components. We will also add details on the benchmark dataset splits, the hyperparameter selection procedure (including validation strategies), and statistical significance measures such as means and standard deviations across multiple runs. These additions will help address potential concerns about selection bias or post-hoc tuning. revision: yes
Circularity Check
No circularity: performance claims are empirical results on external benchmarks
full rationale
The paper proposes Q-Zoom with a Dynamic Gating Network and Self-Distilled Region Proposal Network trained via consistency-aware labeling and self-supervised distillation, then fuses outputs and reports speedups (2.52× / 4.39×) plus accuracy matching or exceeding baselines on Document & OCR and high-resolution benchmarks. These metrics are obtained from direct experimental evaluation on held-out test sets using Qwen2.5-VL-7B and other models; no equations, fitted parameters, or self-citations reduce the reported throughput or accuracy figures to the inputs by construction. The derivation chain consists of standard training and inference procedures whose outputs are independently measurable against external data.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight Dynamic Gating Network ... Self-Distilled Region Proposal Network (SD-RPN) ... consistency-aware generation strategy ... self-supervised distillation paradigm ... continuous spatio-temporal positional encoding
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks ... 4.39 times in High-Resolution scenarios
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023
2023
-
[2]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...
2025
-
[4]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
Y .-F. Zhang, X. Lu, S. Yin, C. Fu, W. Chen, X. Hu, B. Wen, K. Jiang, C. Liu, T. Zhanget al., “Thyme: Think beyond images,”arXiv preprint arXiv:2508.11630, 2025. PREPRINT 13
work page internal anchor Pith review arXiv 2025
-
[6]
X. Lai, J. Li, W. Li, T. Liu, T. Li, and H. Zhao, “Mini-o3: Scaling up reasoning patterns and interaction turns for visual search,”arXiv preprint arXiv:2509.07969, 2025
-
[7]
Monkey: Image resolution and text label are important things for large multi-modal models,
Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” inCVPR, 2024
2024
- [8]
-
[9]
DeepSeek-OCR: Contexts Optical Compression
H. Wei, Y . Sun, and Y . Li, “Deepseek-ocr: Contexts optical compres- sion,”arXiv preprint arXiv:2510.18234, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[11]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “Pi0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Instructblip: Towards general-purpose vision-language models with instruction tuning,
W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,”NeurIPS, 2023
2023
-
[14]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” 2023
2023
-
[15]
Llavanext: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llavanext: Improved reasoning, ocr, and world knowledge,” 2024
2024
-
[16]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021
2021
-
[19]
Y . Zhang, Y . Liu, Z. Guo, Y . Zhang, X. Yang, C. Chen, J. Song, B. Zheng, Y . Yao, Z. Liuet al., “Llava-uhd v2: an mllm integrating high-resolution feature pyramid via hierarchical window transformer,” arXiv preprint arXiv:2412.13871, 2024
-
[20]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 185–24 198
2024
-
[22]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,
M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsinet al., “Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,” NeurIPS, 2023
2023
-
[23]
Mllms know where to look: Training-free perception of small visual details with multimodal llms,
J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” inICLR, 2025
2025
-
[24]
Focus: Internal mllm representations for efficient fine- grained visual question answering,
L. Zhong, F. Rosenthal, J. Sicking, F. H ¨uger, T. Bagdonat, H. Gottschalk, and L. Schwinn, “Focus: Internal mllm representations for efficient fine- grained visual question answering,” inNeurIPS, 2025
2025
-
[25]
Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration,
H. Shen, K. Zhao, T. Zhao, R. Xu, Z. Zhang, M. Zhu, and J. Yin, “Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
2025
-
[26]
Thinking with images,
OpenAI, “Thinking with images,” https://openai.com/index/ thinking-with-images/, 2025
2025
-
[27]
Vision function layer in multimodal llms,
C. Shi, Y . Yu, and S. Yang, “Vision function layer in multimodal llms,” inNeurIPS, 2025
2025
-
[28]
Your large vision-language model only needs a few attention heads for visual grounding,
S. Kang, J. Kim, J. Kim, and S. J. Hwang, “Your large vision-language model only needs a few attention heads for visual grounding,” inCVPR, 2025
2025
- [29]
-
[30]
Infographicvqa,
M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” inWACV, 2022
2022
-
[31]
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022
-
[32]
Ocr-vqa: Visual question answering by reading text in images,
A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answering by reading text in images,” in2019 inter- national conference on document analysis and recognition (ICDAR). IEEE, 2019
2019
-
[33]
P. Wu and S. Xie, “V*: Guided visual search as a core mechanism in multimodal llms,”arXiv preprint arXiv:2312.14135, 2023
-
[34]
Divide, conquer and combine: A training-free framework for high- resolution image perception in multimodal large language models,
W. Wang, L. Ding, M. Zeng, X. Zhou, L. Shen, Y . Luo, and D. Tao, “Divide, conquer and combine: A training-free framework for high- resolution image perception in multimodal large language models,” arXiv preprint, 2024
2024
-
[35]
Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?
Y .-F. Zhang, H. Zhang, H. Tian, C. Fu, S. Zhang, J. Wu, F. Li, K. Wang, Q. Wen, Z. Zhanget al., “Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?” inICLR, 2025
2025
-
[36]
arXiv preprint arXiv:2602.11858 , year=
L. Wei, L. He, J. Lan, L. Dong, Y . Cai, S. Li, H. Zhu, W. Wang, L. Kong, Y . Wang, Z. Zhang, and W. Huang, “Zooming without zooming: Region- to-image distillation for fine-grained multimodal perception,”arXiv preprint arXiv:2602.11858, 2026
-
[37]
Catching the details: Self-distilled roi predictors for fine-grained mllm perception,
Y . Shi, X. Pei, M. Dong, and C. Xu, “Catching the details: Self-distilled roi predictors for fine-grained mllm perception,” inICLR, 2026
2026
-
[38]
Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning, 2023
2023
-
[39]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023
2023
-
[40]
Perception Encoder: The best visual embeddings are not at the output of the network
D. Bolya, P.-Y . Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. Rasheedet al., “Perception encoder: The best visual embeddings are not at the output of the network,”arXiv preprint arXiv:2504.13181, 2025
work page internal anchor Pith review arXiv 2025
-
[41]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yanget al., “Deepseek-vl: towards real-world vision-language understanding,”arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review arXiv 2024
-
[42]
Fastvlm: Efficient vision encoding for vision language models,
P. K. A. Vasu, F. Faghri, C.-L. Li, C. Koc, N. True, A. Antony, G. Santhanam, J. Gabriel, P. Grasch, O. Tuzelet al., “Fastvlm: Efficient vision encoding for vision language models,” inCVPR, 2025
2025
-
[43]
Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models
G. Luo, Y . Zhou, Y . Zhang, X. Zheng, X. Sun, and R. Ji, “Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models,”arXiv preprint arXiv:2403.03003, 2024
-
[44]
Mg-llava: Towards multi-granularity visual instruction tuning,
X. Zhao, X. Li, H. Duan, H. Huang, Y . Li, K. Chen, and H. Yang, “Mg-llava: Towards multi-granularity visual instruction tuning,”IEEE Transactions on Circuits and Systems for Video Technology, 2025
2025
-
[45]
Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chen, R. Chu, S. Liu, and J. Jia, “Mini-gemini: Mining the potential of multi-modality vision language models,”arXiv preprint arXiv:2403.18814, 2024
-
[46]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inICCV, 2023
2023
-
[47]
A convnet for the 2020s,
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inCVPR, 2022
2022
-
[48]
Honeybee: Locality-enhanced projector for multimodal llm,
J. Cha, W. Kang, J. Mun, and B. Roh, “Honeybee: Locality-enhanced projector for multimodal llm,” inCVPR, 2024
2024
-
[49]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
X. An, Y . Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y . Wang, S. Xu, C. Chen, D. Zhuet al., “Llava-onevision-1.5: Fully open framework for democratized multimodal training,”arXiv preprint arXiv:2509.23661, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wanget al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding,”arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review arXiv 2024
-
[52]
S. Lu, Y . Li, Y . Xia, Y . Hu, S. Zhao, Y . Ma, Z. Wei, Y . Li, L. Duan, J. Zhaoet al., “Ovis2. 5 technical report,”arXiv preprint arXiv:2508.11737, 2025
-
[53]
Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling,
X. Liu, Y . Hu, Y . Zou, L. Wu, J. Xu, and B. Zheng, “Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling,” arXiv preprint arXiv:2510.00054, 2025. PREPRINT 14
-
[54]
Focusing by contrastive attention: Enhancing vlms’ visual reasoning,
Y . Ge, S. Liu, Y . Wang, L. Mei, B. Bi, X. Zhou, J. Yao, J. Guo, and X. Cheng, “Focusing by contrastive attention: Enhancing vlms’ visual reasoning,”arXiv preprint arXiv:2509.06461, 2025
-
[55]
Z. Liu, Z. Chen, H. Liu, C. Luo, X. Tang, S. Wang, J. Zeng, Z. Dai, Z. Shi, T. Weiet al., “Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms,”arXiv preprint arXiv:2510.17771, 2025
-
[56]
Token-efficient vlm: High-resolution image understanding via dynamic region proposal,
Y . Jiang, J. Gu, T. Xue, K. C. Cheung, P. Molchanov, H. Yin, and S. Liu, “Token-efficient vlm: High-resolution image understanding via dynamic region proposal,” inICCV, 2025
2025
-
[57]
Scaling vision pre-training to 4k resolution,
B. Shi, B. Li, H. Cai, Y . Lu, S. Liu, M. Pavone, J. Kautz, S. Han, T. Darrell, P. Molchanovet al., “Scaling vision pre-training to 4k resolution,” inCVPR, 2025
2025
-
[58]
Hypervl: An efficient and dynamic multimodal large language model for edge devices,
H. Team, Y . Liu, K. Han, Z. Xia, Y . Dong, C. Song, K. Tang, J. Xu, X. Feng, W. Yuet al., “Hypervl: An efficient and dynamic multimodal large language model for edge devices,”arXiv preprint arXiv:2512.14052, 2025
-
[59]
On the faithfulness of visual thinking: Measurement and enhancement,
Z. Liu, J. Pan, Q. She, Y . Gao, and G. Xia, “On the faithfulness of visual thinking: Measurement and enhancement,”arXiv preprint arXiv:2510.23482, 2025
-
[60]
Thinking with images via self- calling agent,
W. Yang, Y . Zhao, F. Wan, and Q. Ye, “Thinking with images via self- calling agent,”arXiv preprint arXiv:2512.08511, 2025
-
[61]
Q. Wang, Y . Shi, Y . Wang, Y . Zhang, P. Wan, K. Gai, X. Ying, and Y . Wang, “Monet: Reasoning in latent visual space beyond images and language,”arXiv preprint arXiv:2511.21395, 2025
-
[62]
Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a
B. Li, X. Sun, J. Liu, Z. Wang, J. Wu, X. Yu, H. Chen, E. Bar- soum, M. Chen, and Z. Liu, “Latent visual reasoning,”arXiv preprint arXiv:2509.24251, 2025
-
[63]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[64]
Root mean square layer normalization,
B. Zhang and R. Sennrich, “Root mean square layer normalization,” NeurIPS, 2019
2019
-
[65]
Vision Transformers Need Registers
T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,”arXiv preprint arXiv:2309.16588, 2023
work page internal anchor Pith review arXiv 2023
-
[66]
See what you are told: Visual attention sink in large multimodal models,
S. Kang, J. Kim, J. Kim, and S. J. Hwang, “See what you are told: Visual attention sink in large multimodal models,” inICLR, 2025
2025
-
[67]
When do we not need larger vision models?
B. Shi, Z. Wu, M. Mao, X. Wang, and T. Darrell, “When do we not need larger vision models?” inECCV, 2024
2024
-
[68]
Visionthink: Smart and efficient vision language model via reinforcement learning,
S. Yang, J. Li, X. Lai, B. Yu, H. Zhao, and J. Jia, “Visionthink: Smart and efficient vision language model via reinforcement learning,” inNeurIPS, 2025
2025
-
[69]
Adaptvision: Efficient vision-language models via adaptive visual acquisition,
Z. Lin, Y . Liu, Y . Yang, L. Tao, and D. Ye, “Adaptvision: Efficient vision-language models via adaptive visual acquisition,” inCVPR, 2026
2026
-
[70]
Deepeyesv2: Toward agentic multimodal model,
J. Hong, C. Zhao, C. Zhu, W. Lu, G. Xu, and X. Yu, “Deepeyesv2: Toward agentic multimodal model,” inICLR, 2026
2026
-
[71]
Ocrbench: on the hidden mystery of ocr in large multimodal models,
Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.-L. Liu, L. Jin, and X. Bai, “Ocrbench: on the hidden mystery of ocr in large multimodal models,”Science China Information Sciences, 2024
2024
-
[72]
Towards vqa models that can read,
A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, 2019
2019
-
[73]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review arXiv 2023
-
[74]
Are we on the right way for evaluating large vision-language models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are we on the right way for evaluating large vision-language models?” inNeurIPS, 2024
2024
-
[75]
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al
K. Zhang, B. Li, P. Zhang, F. Pu, J. A. Cahyono, K. Hu, S. Liu, Y . Zhang, J. Yang, C. Li, and Z. Liu, “Lmms-eval: Reality check on the evaluation of large multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12772
-
[76]
Efficient memory management for large language model serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[77]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inECCV, 2024
2024
-
[78]
Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,
H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” NeurIPS, 2024
2024
-
[79]
Gqa: A new dataset for real-world visual reasoning and compositional question answering,
D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, 2019. PREPRINT 15 APPENDIXA IMPLEMENTATION ANDPROMPTDETAILS A. More Implementation Details Training Configurations.The optimization hyperparame- ters for the three core components of Q-Zoom are detailed in Table IX. Across all...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.