pith. sign in

arxiv: 2605.18603 · v1 · pith:JS7DRHU4new · submitted 2026-05-18 · 💻 cs.CV

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Pith reviewed 2026-05-20 10:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsactive perceptionlazy perceptionvisual bandwidthperceptual starvationsituated agentszoom crop pan operations
0
0 comments X

The pith

Constraining each visual observation to a tight token budget forces VLMs to learn functional active perception rather than lazy mimicry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models deployed as agents tend to mimic zoom, crop, and pan operations without actually depending on their results, because coarse global views plus language priors often suffice for moderate accuracy. The paper traces this lazy perception to a learning asymmetry where models have no incentive to perform harder multi-step visual search. Starve to Perceive removes that shortcut by restricting each observation to a small token budget so that no single view can complete the task. This minimal plug-in change to standard training produces roughly 5 percent average relative gains across benchmarks without any auxiliary losses, reward shaping, or architecture modifications.

Core claim

When visual input per observation is limited to a tight token budget, training makes active perception the only viable path, so models learn to issue and depend on zoom, crop, and pan operations instead of ignoring their outputs.

What carries the argument

Perceptual starvation via constrained visual bandwidth that limits tokens per observation and thereby requires multi-step visual search.

If this is right

  • Active perception becomes functionally necessary during training rather than optional.
  • The same gains appear without adding losses, rewards, or new model components.
  • The method works as a drop-in change to existing post-training pipelines.
  • Improvements hold across diverse benchmarks for high-resolution situated agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Bandwidth limits may similarly encourage active exploration in other multimodal agent settings such as navigation or robotics.
  • The result suggests that many current perception shortfalls in VLMs arise from training incentives rather than inherent model limits.
  • Testing the approach at still higher resolutions could reveal whether the token starvation scales or requires further adjustments.

Load-bearing premise

Limiting each observation to a tight token budget will eliminate viable shortcuts and force the model to learn useful zoom, crop, and pan operations rather than failing or inventing other workarounds.

What would settle it

Measure whether performance stays high on the same tasks when the trained model is forced to ignore or disable its zoom, crop, and pan operations.

Figures

Figures reproduced from arXiv: 2605.18603 by Cong Wei, Fangzhen Lin, Haozhe Wang, Wenhu Chen, Yuhuan Wu.

Figure 1
Figure 1. Figure 1: Overview of Starve to Perceive. (a) A Visual Bandwidth (parametrized by B) limits the upper bound of both the global image and cropped regions (b) Two-stage training: Budget-Aware Visual Instruction Tuning initializes exploration under token constraints; Reinforcement Learning with Perceptual Starvation train the model via self-collected trajectories under visual constrain to learn active perception. Contr… view at source ↗
Figure 2
Figure 2. Figure 2: Training Dynamics and Final Performance of Budget Ablation Across Training Stages. "All Direct Ratio" measures the proportion of queries where the model con￾sistently bypasses visual grounding and directly answers across all sampled rollouts at a given policy checkpoint, while "All Focus Ratio" measures the proportion of queries for which the model consistently chooses to select regions across all sampled … view at source ↗
Figure 3
Figure 3. Figure 3: RL training cost compar￾ison. Budget-Aware SFT (ZeroShot + BudgetRL), which collapses toward direct an￾swering during RL, achieves the lowest performance across all high-resolution visual search benchmarks. The model trained without the visual constraint dur￾ing RL (BA-SFT + VanillaRL), which exhibits weaker active-perception pres￾sure despite a healthy SFT initialization, yields intermediate scores. Our f… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative example of active perception. Our budget-aware model focuses on informative regions and grounds its answer in the returned local evidence, while the non-budgeted baseline exhibits a lazy-perception failure [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'Starve to Perceive,' a minimal post-training modification for Vision-Language Models that restricts each visual observation to a tight token budget. This constraint is intended to eliminate 'lazy perception'—where models mimic zoom/crop/pan operations without functionally depending on their outputs—by making active perception the only viable path to task success. The central empirical claim is an average 5% relative improvement across diverse benchmarks, achieved without auxiliary losses, reward shaping, or architectural changes.

Significance. If the gains are shown to arise specifically from functional active perception rather than side-effects of the constraint, the method would provide a simple, plug-in intervention for improving VLM agents in high-resolution settings. The approach directly targets a documented learning asymmetry and requires no extra machinery, which is a practical strength for adoption in existing pipelines.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation: The manuscript reports a 5% average relative improvement but provides no details on the exact benchmarks, baseline comparisons, statistical significance, variance across runs, or ablation controls. This prevents evaluation of whether the gains support the claim that perceptual starvation forces active perception.
  2. [Training Paradigm and Ablation Analysis] Training Paradigm and Ablation Analysis: No post-training ablation is reported in which the learned zoom/crop/pan operations are disabled or replaced by fixed/random views. Without this test, it remains possible that improvements arise from implicit regularization, altered gradient flow, or forced multi-turn reasoning rather than functional dependence on active perception outputs, undermining the central mechanistic claim.
minor comments (2)
  1. [Abstract] The abstract refers to 'diverse benchmarks' without naming them; listing the specific tasks and datasets in the abstract or introduction would improve immediate readability.
  2. [Method] Notation for the token budget constraint and observation limit should be introduced with a clear equation or pseudocode early in the method section to avoid ambiguity when describing the starvation mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments have prompted us to strengthen the experimental section with additional details and controls. We respond to each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation: The manuscript reports a 5% average relative improvement but provides no details on the exact benchmarks, baseline comparisons, statistical significance, variance across runs, or ablation controls. This prevents evaluation of whether the gains support the claim that perceptual starvation forces active perception.

    Authors: We appreciate the referee noting the need for greater transparency. The original manuscript presents the 5% relative gain in Section 4 across a suite of VQA, reasoning, and navigation benchmarks, with comparisons to standard fine-tuning baselines. To address the concern directly, the revised version adds explicit listings of all datasets, a new table with full baseline results, standard deviations computed over three independent runs, and paired t-test p-values confirming statistical significance (p < 0.05) for the reported improvements. Expanded ablation tables on token-budget sizes are also included in Section 5. revision: yes

  2. Referee: [Training Paradigm and Ablation Analysis] Training Paradigm and Ablation Analysis: No post-training ablation is reported in which the learned zoom/crop/pan operations are disabled or replaced by fixed/random views. Without this test, it remains possible that improvements arise from implicit regularization, altered gradient flow, or forced multi-turn reasoning rather than functional dependence on active perception outputs, undermining the central mechanistic claim.

    Authors: This is a fair and important point for isolating the mechanism. In the revised manuscript we add a post-training ablation that freezes the active-perception policy and substitutes fixed random views at inference time. Performance falls back to levels statistically indistinguishable from the unconstrained baseline, while attention maps show markedly lower utilization of the provided visual tokens. These results support that the gains arise from functional dependence on the learned operations rather than regularization or multi-turn effects alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training modification with no derivations

full rationale

The paper presents Starve to Perceive as a practical training change that restricts visual token budget per observation to force active perception strategies in VLMs. The reported outcome is an empirical 5% average relative improvement across benchmarks, achieved without auxiliary losses or architectural modifications. No equations, first-principles derivations, or predictions are offered that could reduce the gains to fitted parameters, self-defined quantities, or self-citation chains by construction. The motivation (that tight bandwidth makes active perception the only viable path) is a design rationale, not a mathematical claim that loops back on itself. The work is therefore self-contained as an experimental result rather than a derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that token-budget restriction will induce active perception rather than training collapse or alternative shortcuts. No free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Coarse global views plus language priors are sufficient for moderate accuracy on the target tasks, removing any incentive for multi-step visual search.
    Stated in the abstract as the root cause of lazy perception.

pith-pipeline@v0.9.0 · 5725 in / 1267 out tokens · 30590 ms · 2026-05-20T10:48:21.004790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    By restricting the maximum token count per glimpse, we introduce a strict upper bound on the channel capacity between the original high-resolution image X and the model’s internal state. ... the only viable mathematical solution to maximize the objective is to learn a policy that actively filters out noise

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the constrained environment acts as a strict physical regularizer ... active multi-step visual reasoning ceases to be an optional strategy; it becomes the singular pathway to maximizing the reward

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 20 internal anchors

  1. [1]

    arXiv preprint arXiv:2511.05017 (2025) 13

    Agrawal, A., KV, G., Aralikatti, R., Jagatap, G., Yuan, J., Kamarshi, V., Fanelli, A., Huang, F.: Towards mitigating hallucinations in large vision-language models by refining textual embeddings. arXiv preprint arXiv:2511.05017 (2025) 13

  2. [2]

    Advances in neural information processing systems35, 23716– 23736 (2022) 13

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 13

  3. [3]

    Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments (2018),https:// arxiv.org/abs/1711.072801

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Chen, K., Du, M., Fan, Y., Fan, Z., Ge, W., Liu, D., Men, R., Ren, X., et al.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 13

  5. [5]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 7, 8

  6. [6]

    Feng, H., Liu, Q., Liu, H., Tang, J., Zhou, W., Li, H., Huang, C.: Docpedia: Un- leashing the power of large multimodal model in the frequency domain for versatile document understanding (2024),https://arxiv.org/abs/2311.118101

  7. [7]

    Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., et al.: Gemini: A family of highlycapablemultimodalmodels.arXivpreprintarXiv:2508.11630(2025),https: //arxiv.org/abs/2312.1180513

  8. [8]

    Cur- rent opinion in neurobiology21(4), 553–558 (2011) 1

    Ibbotson, M., Krekelberg, B.: Visual perception and saccadic eye movements. Cur- rent opinion in neurobiology21(4), 553–558 (2011) 1

  9. [9]

    In: International conference on machine learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021) 13

  10. [10]

    Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search (2025),https://arxiv.org/abs/ 2509.079697, 8, 19, 20, 21

  11. [11]

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer (2024),https: //arxiv.org/abs/2408.033268

  12. [12]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 13

  13. [13]

    arXiv preprint arXiv:2508.09456 (2025) 21

    Li, J., Xu, B., Chen, S., Li, J., Lei, J., Zhao, H., Zhang, D.: Iag: Input-aware backdoor attack on vlm-based visual grounding. arXiv preprint arXiv:2508.09456 (2025) 21

  14. [14]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Li, J., Zhang, D., Wang, X., Hao, Z., Lei, J., Tan, Q., Zhou, C., Liu, W., Yang, Y., Xiong, X., et al.: Chemvlm: Exploring the power of multimodal large language models in chemistry area. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 415–423 (2025) 21

  15. [15]

    Evaluating Object Hallucination in Large Vision-Language Models

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023) 13

  16. [16]

    arXiv preprint arXiv:2508.04567 (2025) 13

    Li, Y., Zhou, K., Zhao, W.X., Fang, L., Wen, J.R.: Analyzing and mitigating object hallucination: A training bias perspective. arXiv preprint arXiv:2508.04567 (2025) 13

  17. [17]

    Advances in Neural Information Processing Systems , year =

    Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X.E., Zhou, Y., Liu, S.: More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523 (2025) 13

  18. [18]

    Advances in neural information processing systems36, 34892–34916 (2023) 13

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 13

  19. [19]

    Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning (2022),https: //arxiv.org/abs/2203.102441

  20. [20]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 13

  21. [21]

    nature323(6088), 533–536 (1986) 2

    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back- propagating errors. nature323(6088), 533–536 (1986) 2

  22. [22]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300 7

  23. [23]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: ZoomEye: Enhancing multimodal LLMs with human-like zooming capabilities through tree- based image exploration. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. pp. 6602–6618. Ass...

  24. [24]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 18

  25. [25]

    Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle (2015),https://arxiv.org/abs/1503.024064

  26. [26]

    Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., Li, Y.: Screen2words: Automatic mobile ui summarization with multimodal learning (2021),https: //arxiv.org/abs/2108.033531

  27. [27]

    Wang, C., Wang, H., Chen, X., Liu, J., Xue, T., Peng, C., Qi, D., Lin, F., Yan, Y.: From illusion to intention: Visual rationale learning for vision-language reasoning (2025),https://arxiv.org/abs/2511.230312, 13, 21

  28. [28]

    Wang, H., Li, X., Huang, Z., Wang, A., Wang, J., Zhang, T., Zheng, J., Bai, S., Kang, Z., Feng, J., Wang, Z., Zhang, Z.: Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology (2025),https://arxiv.org/ abs/2507.079997, 8, 19, 20, 21

  29. [29]

    In:FindingsoftheAssociationforComputationalLinguistics:ACL2025.pp.3060– 3075 (2025) 21

    Wang, H., Li, L., Qu, C., Xu, W., Zhu, F., Chu, W., Lin, F.: To code or not to code? adaptive tool integration for math language models via expectation-maximization. In:FindingsoftheAssociationforComputationalLinguistics:ACL2025.pp.3060– 3075 (2025) 21

  30. [30]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight 21

    Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight 21

  31. [31]

    In: International Conference on Learning Representations (ICLR) (2026) 21

    Wang, H., Que, H., Xu, Q., Liu, M., Zhou, W., Feng, J., Zhong, W., Ye, W., Yang, T., Huang, W., et al.: Reverse-engineered reasoning for open-ended generation. In: International Conference on Learning Representations (ICLR) (2026) 21

  32. [32]

    Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning (2025),https: //arxiv.org/abs/2505.159662, 7, 8, 13, 20

  33. [33]

    RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

    Wang, H., Wei, C., Ren, W., Liu, J., Lin, F., Chen, W.: Rationalrewards: Rea- soning rewards scale visual generation both training and test time. arXiv preprint arXiv:2604.11626 (2026) 21

  34. [34]

    In: International Conference on Learning Representations (ICLR) (2026) 21

    Wang, H., Xu, Q., Liu, C., Wu, J., Lin, F., Chen, W.: Emergent hierarchical rea- soning in llms through reinforcement learning. In: International Conference on Learning Representations (ICLR) (2026) 21

  35. [35]

    In: International Conference on Machine Learning (ICML) (2026), spotlight 21

    Wang, H., Xu, Q., Wang, C., Xue, T., Peng, C., Chen, W., Lin, F.: Bad seeing or bad thinking? rewarding perception for vision-language reasoning. In: International Conference on Machine Learning (ICML) (2026), spotlight 21

  36. [36]

    Wang,W.,Ding,L.,Zeng,M.,Zhou,X.,Shen,L.,Luo,Y.,Tao,D.:Divide,conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models (2024),https://arxiv.org/abs/2408.155568

  37. [37]

    Wang, X., Huang, J., Abdalla, R., Zhang, C., Xian, R., Manocha, D.: Bi-vlm: Push- ing ultra-low precision post-training quantization boundaries in vision-language models (2025),https://arxiv.org/abs/2509.187632

  38. [38]

    Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms (2023),https://arxiv.org/abs/2312.141358, 13

  39. [39]

    xAI: Grok-1.5 vision preview.https://x.ai/news/grok- 1.5v(Apr 2024), ac- cessed: 2024-08-27 8

  40. [40]

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...

  41. [41]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, D., Lei, J., Li, J., Wang, X., Liu, Y., Yang, Z., Li, J., Wang, W., Yang, S., Wu, J., et al.: Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9050–9061 (2025) 21

  42. [42]

    Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms (2025),https://arxiv.org/abs/2505.154368, 13

  43. [43]

    Zhang, Y.F., Zhang, H., Tian, H., Fu, C., Zhang, S., Wu, J., Li, F., Wang, K., Wen, Q., Zhang, Z., Wang, L., Jin, R., Tan, T.: Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? (2025),https://arxiv.org/abs/2408.132578

  44. [44]

    Evaluating and steering modality preferences in multimodal large language model

    Zhang, Y., Ma, J., Hou, Y., Bai, X., Chen, K., Xiang, Y., Yu, J., Zhang, M.: Evaluating and steering modality preferences in multimodal large language model, 2025a. URL https://arxiv. org/abs/2505.20977 21

  45. [45]

    Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

    Zhang, Y., Xu, M., Bai, X., Zhang, P., Xiang, Y., Zhang, M., et al.: Instruction anchors: Dissecting the causal dynamics of modality arbitration. arXiv preprint arXiv:2602.03677 (2026) 21

  46. [46]

    When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms,

    Zhang, Z., Wang, T., Gong, X., Shi, Y., Wang, H., Wang, D., Hu, L.: When modal- ities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms. arXiv preprint arXiv:2511.02243 (2025) 13

  47. [47]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024),http://arxiv.org/abs/24...

  48. [48]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.143627, 8, 13, 21

  49. [49]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 13 Supplementary Material A Limitations WhileStarve to Perceiveachieves strong gains with minimal modifications to existin...

  50. [50]

    The agent thinks about needing to examine the region closely

  51. [51]

    The agent calls focus on the specific region(s) from the input

  52. [52]

    Wait, this box seems wrong

    If a box is marked "Wait, this box seems wrong.", the agent 13self-corrects by pivoting to other regions. 14 15Strict Rules: 16- PRESERVE all bounding box coordinates exactly. 17- ALL <box>...</box> bboxes must appear in <tool_call>...</tool_call>. 18- Each turn: <think>...</think> <tool_call>...</tool_call> 19OR: <think>...</think> <answer>...</answer> 2...