Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Cong Wei; Fangzhen Lin; Haozhe Wang; Wenhu Chen; Yuhuan Wu

arxiv: 2605.18603 · v1 · pith:JS7DRHU4new · submitted 2026-05-18 · 💻 cs.CV

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Yuhuan Wu , Cong Wei , Fangzhen Lin , Wenhu Chen , Haozhe Wang This is my paper

Pith reviewed 2026-05-20 10:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsactive perceptionlazy perceptionvisual bandwidthperceptual starvationsituated agentszoom crop pan operations

0 comments

The pith

Constraining each visual observation to a tight token budget forces VLMs to learn functional active perception rather than lazy mimicry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models deployed as agents tend to mimic zoom, crop, and pan operations without actually depending on their results, because coarse global views plus language priors often suffice for moderate accuracy. The paper traces this lazy perception to a learning asymmetry where models have no incentive to perform harder multi-step visual search. Starve to Perceive removes that shortcut by restricting each observation to a small token budget so that no single view can complete the task. This minimal plug-in change to standard training produces roughly 5 percent average relative gains across benchmarks without any auxiliary losses, reward shaping, or architecture modifications.

Core claim

When visual input per observation is limited to a tight token budget, training makes active perception the only viable path, so models learn to issue and depend on zoom, crop, and pan operations instead of ignoring their outputs.

What carries the argument

Perceptual starvation via constrained visual bandwidth that limits tokens per observation and thereby requires multi-step visual search.

If this is right

Active perception becomes functionally necessary during training rather than optional.
The same gains appear without adding losses, rewards, or new model components.
The method works as a drop-in change to existing post-training pipelines.
Improvements hold across diverse benchmarks for high-resolution situated agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Bandwidth limits may similarly encourage active exploration in other multimodal agent settings such as navigation or robotics.
The result suggests that many current perception shortfalls in VLMs arise from training incentives rather than inherent model limits.
Testing the approach at still higher resolutions could reveal whether the token starvation scales or requires further adjustments.

Load-bearing premise

Limiting each observation to a tight token budget will eliminate viable shortcuts and force the model to learn useful zoom, crop, and pan operations rather than failing or inventing other workarounds.

What would settle it

Measure whether performance stays high on the same tasks when the trained model is forced to ignore or disable its zoom, crop, and pan operations.

Figures

Figures reproduced from arXiv: 2605.18603 by Cong Wei, Fangzhen Lin, Haozhe Wang, Wenhu Chen, Yuhuan Wu.

**Figure 1.** Figure 1: Overview of Starve to Perceive. (a) A Visual Bandwidth (parametrized by B) limits the upper bound of both the global image and cropped regions (b) Two-stage training: Budget-Aware Visual Instruction Tuning initializes exploration under token constraints; Reinforcement Learning with Perceptual Starvation train the model via self-collected trajectories under visual constrain to learn active perception. Contr… view at source ↗

**Figure 2.** Figure 2: Training Dynamics and Final Performance of Budget Ablation Across Training Stages. "All Direct Ratio" measures the proportion of queries where the model consistently bypasses visual grounding and directly answers across all sampled rollouts at a given policy checkpoint, while "All Focus Ratio" measures the proportion of queries for which the model consistently chooses to select regions across all sampled … view at source ↗

**Figure 3.** Figure 3: RL training cost comparison. Budget-Aware SFT (ZeroShot + BudgetRL), which collapses toward direct answering during RL, achieves the lowest performance across all high-resolution visual search benchmarks. The model trained without the visual constraint during RL (BA-SFT + VanillaRL), which exhibits weaker active-perception pressure despite a healthy SFT initialization, yields intermediate scores. Our f… view at source ↗

**Figure 4.** Figure 4: Qualitative example of active perception. Our budget-aware model focuses on informative regions and grounds its answer in the returned local evidence, while the non-budgeted baseline exhibits a lazy-perception failure [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The token-budget starvation trick pushes VLMs toward active perception with a minimal change and a reported 5% gain, but the causal link to learned visual operations still lacks direct confirmation.

read the letter

The main point is that this paper shows how limiting visual tokens per observation during post-training can reduce lazy perception in VLMs. By making a single view insufficient, the constraint forces the model to use zoom, crop, and pan operations to solve tasks, and they report a 5% average relative improvement across benchmarks with no extra losses or architecture changes. That minimalism is the practical strength here. It slots into standard pipelines without reward shaping or auxiliary objectives, which makes it straightforward to test on existing agent setups for robotics or interactive systems. The framing of the learning asymmetry, where models skip hard visual search if language priors suffice, is clear and directly motivates the approach. What stands out is the focus on functional dependence rather than just mimicking the surface actions. The soft spots center on the mechanism. The gains are presented, but without an ablation that disables the learned visual operations after training and measures the drop, other factors like regularization or forced multi-turn reasoning could explain the results instead. The abstract and description do not include that check, so the claim that active perception becomes the only viable path rests on indirect evidence. If the full experiments have more controls, that would tighten it. This work is for groups building VLMs as situated agents in high-resolution settings. Readers who want simple training adjustments to improve perception without major overhead would find it useful, though they should run their own verification on the causal part. I would send it to peer review because the problem is real and the intervention is clean enough that referees could push for the missing ablations and strengthen the story.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'Starve to Perceive,' a minimal post-training modification for Vision-Language Models that restricts each visual observation to a tight token budget. This constraint is intended to eliminate 'lazy perception'—where models mimic zoom/crop/pan operations without functionally depending on their outputs—by making active perception the only viable path to task success. The central empirical claim is an average 5% relative improvement across diverse benchmarks, achieved without auxiliary losses, reward shaping, or architectural changes.

Significance. If the gains are shown to arise specifically from functional active perception rather than side-effects of the constraint, the method would provide a simple, plug-in intervention for improving VLM agents in high-resolution settings. The approach directly targets a documented learning asymmetry and requires no extra machinery, which is a practical strength for adoption in existing pipelines.

major comments (2)

[Experimental Evaluation] Experimental Evaluation: The manuscript reports a 5% average relative improvement but provides no details on the exact benchmarks, baseline comparisons, statistical significance, variance across runs, or ablation controls. This prevents evaluation of whether the gains support the claim that perceptual starvation forces active perception.
[Training Paradigm and Ablation Analysis] Training Paradigm and Ablation Analysis: No post-training ablation is reported in which the learned zoom/crop/pan operations are disabled or replaced by fixed/random views. Without this test, it remains possible that improvements arise from implicit regularization, altered gradient flow, or forced multi-turn reasoning rather than functional dependence on active perception outputs, undermining the central mechanistic claim.

minor comments (2)

[Abstract] The abstract refers to 'diverse benchmarks' without naming them; listing the specific tasks and datasets in the abstract or introduction would improve immediate readability.
[Method] Notation for the token budget constraint and observation limit should be introduced with a clear equation or pseudocode early in the method section to avoid ambiguity when describing the starvation mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments have prompted us to strengthen the experimental section with additional details and controls. We respond to each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental Evaluation: The manuscript reports a 5% average relative improvement but provides no details on the exact benchmarks, baseline comparisons, statistical significance, variance across runs, or ablation controls. This prevents evaluation of whether the gains support the claim that perceptual starvation forces active perception.

Authors: We appreciate the referee noting the need for greater transparency. The original manuscript presents the 5% relative gain in Section 4 across a suite of VQA, reasoning, and navigation benchmarks, with comparisons to standard fine-tuning baselines. To address the concern directly, the revised version adds explicit listings of all datasets, a new table with full baseline results, standard deviations computed over three independent runs, and paired t-test p-values confirming statistical significance (p < 0.05) for the reported improvements. Expanded ablation tables on token-budget sizes are also included in Section 5. revision: yes
Referee: [Training Paradigm and Ablation Analysis] Training Paradigm and Ablation Analysis: No post-training ablation is reported in which the learned zoom/crop/pan operations are disabled or replaced by fixed/random views. Without this test, it remains possible that improvements arise from implicit regularization, altered gradient flow, or forced multi-turn reasoning rather than functional dependence on active perception outputs, undermining the central mechanistic claim.

Authors: This is a fair and important point for isolating the mechanism. In the revised manuscript we add a post-training ablation that freezes the active-perception policy and substitutes fixed random views at inference time. Performance falls back to levels statistically indistinguishable from the unconstrained baseline, while attention maps show markedly lower utilization of the provided visual tokens. These results support that the gains arise from functional dependence on the learned operations rather than regularization or multi-turn effects alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training modification with no derivations

full rationale

The paper presents Starve to Perceive as a practical training change that restricts visual token budget per observation to force active perception strategies in VLMs. The reported outcome is an empirical 5% average relative improvement across benchmarks, achieved without auxiliary losses or architectural modifications. No equations, first-principles derivations, or predictions are offered that could reduce the gains to fitted parameters, self-defined quantities, or self-citation chains by construction. The motivation (that tight bandwidth makes active perception the only viable path) is a design rationale, not a mathematical claim that loops back on itself. The work is therefore self-contained as an experimental result rather than a derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that token-budget restriction will induce active perception rather than training collapse or alternative shortcuts. No free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Coarse global views plus language priors are sufficient for moderate accuracy on the target tasks, removing any incentive for multi-step visual search.
Stated in the abstract as the root cause of lazy perception.

pith-pipeline@v0.9.0 · 5725 in / 1267 out tokens · 30590 ms · 2026-05-20T10:48:21.004790+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

By restricting the maximum token count per glimpse, we introduce a strict upper bound on the channel capacity between the original high-resolution image X and the model’s internal state. ... the only viable mathematical solution to maximize the objective is to learn a policy that actively filters out noise
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the constrained environment acts as a strict physical regularizer ... active multi-step visual reasoning ceases to be an optional strategy; it becomes the singular pathway to maximizing the reward

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 20 internal anchors

[1]

arXiv preprint arXiv:2511.05017 (2025) 13

Agrawal, A., KV, G., Aralikatti, R., Jagatap, G., Yuan, J., Kamarshi, V., Fanelli, A., Huang, F.: Towards mitigating hallucinations in large vision-language models by refining textual embeddings. arXiv preprint arXiv:2511.05017 (2025) 13

work page arXiv 2025
[2]

Advances in neural information processing systems35, 23716– 23736 (2022) 13

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 13

work page 2022
[3]

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments (2018),https:// arxiv.org/abs/1711.072801

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Chen, K., Du, M., Fan, Y., Fan, Z., Ge, W., Liu, D., Men, R., Ren, X., et al.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Feng, H., Liu, Q., Liu, H., Tang, J., Zhou, W., Li, H., Huang, C.: Docpedia: Un- leashing the power of large multimodal model in the frequency domain for versatile document understanding (2024),https://arxiv.org/abs/2311.118101

work page arXiv 2024
[7]

Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., et al.: Gemini: A family of highlycapablemultimodalmodels.arXivpreprintarXiv:2508.11630(2025),https: //arxiv.org/abs/2312.1180513

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Cur- rent opinion in neurobiology21(4), 553–558 (2011) 1

Ibbotson, M., Krekelberg, B.: Visual perception and saccadic eye movements. Cur- rent opinion in neurobiology21(4), 553–558 (2011) 1

work page 2011
[9]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021) 13

work page 2021
[10]

Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search (2025),https://arxiv.org/abs/ 2509.079697, 8, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer (2024),https: //arxiv.org/abs/2408.033268

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 13

work page 2023
[13]

arXiv preprint arXiv:2508.09456 (2025) 21

Li, J., Xu, B., Chen, S., Li, J., Lei, J., Zhao, H., Zhang, D.: Iag: Input-aware backdoor attack on vlm-based visual grounding. arXiv preprint arXiv:2508.09456 (2025) 21

work page arXiv 2025
[14]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, J., Zhang, D., Wang, X., Hao, Z., Lei, J., Tan, Q., Zhou, C., Liu, W., Yang, Y., Xiong, X., et al.: Chemvlm: Exploring the power of multimodal large language models in chemistry area. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 415–423 (2025) 21

work page 2025
[15]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023) 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

arXiv preprint arXiv:2508.04567 (2025) 13

Li, Y., Zhou, K., Zhao, W.X., Fang, L., Wen, J.R.: Analyzing and mitigating object hallucination: A training bias perspective. arXiv preprint arXiv:2508.04567 (2025) 13

work page arXiv 2025
[17]

Advances in Neural Information Processing Systems , year =

Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X.E., Zhou, Y., Liu, S.: More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523 (2025) 13

work page arXiv 2025
[18]

Advances in neural information processing systems36, 34892–34916 (2023) 13

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 13

work page 2023
[19]

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning (2022),https: //arxiv.org/abs/2203.102441

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 13

work page 2021
[21]

nature323(6088), 533–536 (1986) 2

Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back- propagating errors. nature323(6088), 533–536 (1986) 2

work page 1986
[22]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: ZoomEye: Enhancing multimodal LLMs with human-like zooming capabilities through tree- based image exploration. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. pp. 6602–6618. Ass...

work page doi:10.18653/v1/2025.emnlp- 2025
[24]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 18

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle (2015),https://arxiv.org/abs/1503.024064

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., Li, Y.: Screen2words: Automatic mobile ui summarization with multimodal learning (2021),https: //arxiv.org/abs/2108.033531

work page arXiv 2021
[27]

Wang, C., Wang, H., Chen, X., Liu, J., Xue, T., Peng, C., Qi, D., Lin, F., Yan, Y.: From illusion to intention: Visual rationale learning for vision-language reasoning (2025),https://arxiv.org/abs/2511.230312, 13, 21

work page arXiv 2025
[28]

Wang, H., Li, X., Huang, Z., Wang, A., Wang, J., Zhang, T., Zheng, J., Bai, S., Kang, Z., Feng, J., Wang, Z., Zhang, Z.: Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology (2025),https://arxiv.org/ abs/2507.079997, 8, 19, 20, 21

work page arXiv 2025
[29]

In:FindingsoftheAssociationforComputationalLinguistics:ACL2025.pp.3060– 3075 (2025) 21

Wang, H., Li, L., Qu, C., Xu, W., Zhu, F., Chu, W., Lin, F.: To code or not to code? adaptive tool integration for math language models via expectation-maximization. In:FindingsoftheAssociationforComputationalLinguistics:ACL2025.pp.3060– 3075 (2025) 21

work page 2025
[30]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight 21

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight 21

work page 2025
[31]

In: International Conference on Learning Representations (ICLR) (2026) 21

Wang, H., Que, H., Xu, Q., Liu, M., Zhou, W., Feng, J., Zhong, W., Ye, W., Yang, T., Huang, W., et al.: Reverse-engineered reasoning for open-ended generation. In: International Conference on Learning Representations (ICLR) (2026) 21

work page 2026
[32]

Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning (2025),https: //arxiv.org/abs/2505.159662, 7, 8, 13, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Wang, H., Wei, C., Ren, W., Liu, J., Lin, F., Chen, W.: Rationalrewards: Rea- soning rewards scale visual generation both training and test time. arXiv preprint arXiv:2604.11626 (2026) 21

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

In: International Conference on Learning Representations (ICLR) (2026) 21

Wang, H., Xu, Q., Liu, C., Wu, J., Lin, F., Chen, W.: Emergent hierarchical rea- soning in llms through reinforcement learning. In: International Conference on Learning Representations (ICLR) (2026) 21

work page 2026
[35]

In: International Conference on Machine Learning (ICML) (2026), spotlight 21

Wang, H., Xu, Q., Wang, C., Xue, T., Peng, C., Chen, W., Lin, F.: Bad seeing or bad thinking? rewarding perception for vision-language reasoning. In: International Conference on Machine Learning (ICML) (2026), spotlight 21

work page 2026
[36]

Wang,W.,Ding,L.,Zeng,M.,Zhou,X.,Shen,L.,Luo,Y.,Tao,D.:Divide,conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models (2024),https://arxiv.org/abs/2408.155568

work page arXiv 2024
[37]

Wang, X., Huang, J., Abdalla, R., Zhang, C., Xian, R., Manocha, D.: Bi-vlm: Push- ing ultra-low precision post-training quantization boundaries in vision-language models (2025),https://arxiv.org/abs/2509.187632

work page arXiv 2025
[38]

Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms (2023),https://arxiv.org/abs/2312.141358, 13

work page arXiv 2023
[39]

xAI: Grok-1.5 vision preview.https://x.ai/news/grok- 1.5v(Apr 2024), ac- cessed: 2024-08-27 8

work page 2024
[40]

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, D., Lei, J., Li, J., Wang, X., Liu, Y., Yang, Z., Li, J., Wang, W., Yang, S., Wu, J., et al.: Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9050–9061 (2025) 21

work page 2025
[42]

Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms (2025),https://arxiv.org/abs/2505.154368, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Zhang, Y.F., Zhang, H., Tian, H., Fu, C., Zhang, S., Wu, J., Li, F., Wang, K., Wen, Q., Zhang, Z., Wang, L., Jin, R., Tan, T.: Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? (2025),https://arxiv.org/abs/2408.132578

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Evaluating and steering modality preferences in multimodal large language model

Zhang, Y., Ma, J., Hou, Y., Bai, X., Chen, K., Xiang, Y., Yu, J., Zhang, M.: Evaluating and steering modality preferences in multimodal large language model, 2025a. URL https://arxiv. org/abs/2505.20977 21

work page arXiv
[45]

Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Zhang, Y., Xu, M., Bai, X., Zhang, P., Xiang, Y., Zhang, M., et al.: Instruction anchors: Dissecting the causal dynamics of modality arbitration. arXiv preprint arXiv:2602.03677 (2026) 21

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms,

Zhang, Z., Wang, T., Gong, X., Shi, Y., Wang, H., Wang, D., Hu, L.: When modal- ities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms. arXiv preprint arXiv:2511.02243 (2025) 13

work page arXiv 2025
[47]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024),http://arxiv.org/abs/24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.143627, 8, 13, 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 13 Supplementary Material A Limitations WhileStarve to Perceiveachieves strong gains with minimal modifications to existin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

The agent thinks about needing to examine the region closely

work page
[51]

The agent calls focus on the specific region(s) from the input

work page
[52]

Wait, this box seems wrong

If a box is marked "Wait, this box seems wrong.", the agent 13self-corrects by pivoting to other regions. 14 15Strict Rules: 16- PRESERVE all bounding box coordinates exactly. 17- ALL <box>...</box> bboxes must appear in <tool_call>...</tool_call>. 18- Each turn: <think>...</think> <tool_call>...</tool_call> 19OR: <think>...</think> <answer>...</answer> 2...

work page

[1] [1]

arXiv preprint arXiv:2511.05017 (2025) 13

Agrawal, A., KV, G., Aralikatti, R., Jagatap, G., Yuan, J., Kamarshi, V., Fanelli, A., Huang, F.: Towards mitigating hallucinations in large vision-language models by refining textual embeddings. arXiv preprint arXiv:2511.05017 (2025) 13

work page arXiv 2025

[2] [2]

Advances in neural information processing systems35, 23716– 23736 (2022) 13

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 13

work page 2022

[3] [3]

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments (2018),https:// arxiv.org/abs/1711.072801

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Chen, K., Du, M., Fan, Y., Fan, Z., Ge, W., Liu, D., Men, R., Ren, X., et al.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Feng, H., Liu, Q., Liu, H., Tang, J., Zhou, W., Li, H., Huang, C.: Docpedia: Un- leashing the power of large multimodal model in the frequency domain for versatile document understanding (2024),https://arxiv.org/abs/2311.118101

work page arXiv 2024

[7] [7]

Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., et al.: Gemini: A family of highlycapablemultimodalmodels.arXivpreprintarXiv:2508.11630(2025),https: //arxiv.org/abs/2312.1180513

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Cur- rent opinion in neurobiology21(4), 553–558 (2011) 1

Ibbotson, M., Krekelberg, B.: Visual perception and saccadic eye movements. Cur- rent opinion in neurobiology21(4), 553–558 (2011) 1

work page 2011

[9] [9]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021) 13

work page 2021

[10] [10]

Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search (2025),https://arxiv.org/abs/ 2509.079697, 8, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer (2024),https: //arxiv.org/abs/2408.033268

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 13

work page 2023

[13] [13]

arXiv preprint arXiv:2508.09456 (2025) 21

Li, J., Xu, B., Chen, S., Li, J., Lei, J., Zhao, H., Zhang, D.: Iag: Input-aware backdoor attack on vlm-based visual grounding. arXiv preprint arXiv:2508.09456 (2025) 21

work page arXiv 2025

[14] [14]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, J., Zhang, D., Wang, X., Hao, Z., Lei, J., Tan, Q., Zhou, C., Liu, W., Yang, Y., Xiong, X., et al.: Chemvlm: Exploring the power of multimodal large language models in chemistry area. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 415–423 (2025) 21

work page 2025

[15] [15]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023) 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

arXiv preprint arXiv:2508.04567 (2025) 13

Li, Y., Zhou, K., Zhao, W.X., Fang, L., Wen, J.R.: Analyzing and mitigating object hallucination: A training bias perspective. arXiv preprint arXiv:2508.04567 (2025) 13

work page arXiv 2025

[17] [17]

Advances in Neural Information Processing Systems , year =

Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X.E., Zhou, Y., Liu, S.: More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523 (2025) 13

work page arXiv 2025

[18] [18]

Advances in neural information processing systems36, 34892–34916 (2023) 13

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 13

work page 2023

[19] [19]

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning (2022),https: //arxiv.org/abs/2203.102441

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 13

work page 2021

[21] [21]

nature323(6088), 533–536 (1986) 2

Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back- propagating errors. nature323(6088), 533–536 (1986) 2

work page 1986

[22] [22]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: ZoomEye: Enhancing multimodal LLMs with human-like zooming capabilities through tree- based image exploration. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. pp. 6602–6618. Ass...

work page doi:10.18653/v1/2025.emnlp- 2025

[24] [24]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 18

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle (2015),https://arxiv.org/abs/1503.024064

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [26]

Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., Li, Y.: Screen2words: Automatic mobile ui summarization with multimodal learning (2021),https: //arxiv.org/abs/2108.033531

work page arXiv 2021

[27] [27]

Wang, C., Wang, H., Chen, X., Liu, J., Xue, T., Peng, C., Qi, D., Lin, F., Yan, Y.: From illusion to intention: Visual rationale learning for vision-language reasoning (2025),https://arxiv.org/abs/2511.230312, 13, 21

work page arXiv 2025

[28] [28]

Wang, H., Li, X., Huang, Z., Wang, A., Wang, J., Zhang, T., Zheng, J., Bai, S., Kang, Z., Feng, J., Wang, Z., Zhang, Z.: Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology (2025),https://arxiv.org/ abs/2507.079997, 8, 19, 20, 21

work page arXiv 2025

[29] [29]

In:FindingsoftheAssociationforComputationalLinguistics:ACL2025.pp.3060– 3075 (2025) 21

Wang, H., Li, L., Qu, C., Xu, W., Zhu, F., Chu, W., Lin, F.: To code or not to code? adaptive tool integration for math language models via expectation-maximization. In:FindingsoftheAssociationforComputationalLinguistics:ACL2025.pp.3060– 3075 (2025) 21

work page 2025

[30] [30]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight 21

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight 21

work page 2025

[31] [31]

In: International Conference on Learning Representations (ICLR) (2026) 21

Wang, H., Que, H., Xu, Q., Liu, M., Zhou, W., Feng, J., Zhong, W., Ye, W., Yang, T., Huang, W., et al.: Reverse-engineered reasoning for open-ended generation. In: International Conference on Learning Representations (ICLR) (2026) 21

work page 2026

[32] [32]

Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning (2025),https: //arxiv.org/abs/2505.159662, 7, 8, 13, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Wang, H., Wei, C., Ren, W., Liu, J., Lin, F., Chen, W.: Rationalrewards: Rea- soning rewards scale visual generation both training and test time. arXiv preprint arXiv:2604.11626 (2026) 21

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

In: International Conference on Learning Representations (ICLR) (2026) 21

Wang, H., Xu, Q., Liu, C., Wu, J., Lin, F., Chen, W.: Emergent hierarchical rea- soning in llms through reinforcement learning. In: International Conference on Learning Representations (ICLR) (2026) 21

work page 2026

[35] [35]

In: International Conference on Machine Learning (ICML) (2026), spotlight 21

Wang, H., Xu, Q., Wang, C., Xue, T., Peng, C., Chen, W., Lin, F.: Bad seeing or bad thinking? rewarding perception for vision-language reasoning. In: International Conference on Machine Learning (ICML) (2026), spotlight 21

work page 2026

[36] [36]

Wang,W.,Ding,L.,Zeng,M.,Zhou,X.,Shen,L.,Luo,Y.,Tao,D.:Divide,conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models (2024),https://arxiv.org/abs/2408.155568

work page arXiv 2024

[37] [37]

Wang, X., Huang, J., Abdalla, R., Zhang, C., Xian, R., Manocha, D.: Bi-vlm: Push- ing ultra-low precision post-training quantization boundaries in vision-language models (2025),https://arxiv.org/abs/2509.187632

work page arXiv 2025

[38] [38]

Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms (2023),https://arxiv.org/abs/2312.141358, 13

work page arXiv 2023

[39] [39]

xAI: Grok-1.5 vision preview.https://x.ai/news/grok- 1.5v(Apr 2024), ac- cessed: 2024-08-27 8

work page 2024

[40] [40]

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, D., Lei, J., Li, J., Wang, X., Liu, Y., Yang, Z., Li, J., Wang, W., Yang, S., Wu, J., et al.: Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9050–9061 (2025) 21

work page 2025

[42] [42]

Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms (2025),https://arxiv.org/abs/2505.154368, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Zhang, Y.F., Zhang, H., Tian, H., Fu, C., Zhang, S., Wu, J., Li, F., Wang, K., Wen, Q., Zhang, Z., Wang, L., Jin, R., Tan, T.: Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? (2025),https://arxiv.org/abs/2408.132578

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Evaluating and steering modality preferences in multimodal large language model

Zhang, Y., Ma, J., Hou, Y., Bai, X., Chen, K., Xiang, Y., Yu, J., Zhang, M.: Evaluating and steering modality preferences in multimodal large language model, 2025a. URL https://arxiv. org/abs/2505.20977 21

work page arXiv

[45] [45]

Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Zhang, Y., Xu, M., Bai, X., Zhang, P., Xiang, Y., Zhang, M., et al.: Instruction anchors: Dissecting the causal dynamics of modality arbitration. arXiv preprint arXiv:2602.03677 (2026) 21

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms,

Zhang, Z., Wang, T., Gong, X., Shi, Y., Wang, H., Wang, D., Hu, L.: When modal- ities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms. arXiv preprint arXiv:2511.02243 (2025) 13

work page arXiv 2025

[47] [47]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024),http://arxiv.org/abs/24...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.143627, 8, 13, 21

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 13 Supplementary Material A Limitations WhileStarve to Perceiveachieves strong gains with minimal modifications to existin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

The agent thinks about needing to examine the region closely

work page

[51] [51]

The agent calls focus on the specific region(s) from the input

work page

[52] [52]

Wait, this box seems wrong

If a box is marked "Wait, this box seems wrong.", the agent 13self-corrects by pivoting to other regions. 14 15Strict Rules: 16- PRESERVE all bounding box coordinates exactly. 17- ALL <box>...</box> bboxes must appear in <tool_call>...</tool_call>. 18- Each turn: <think>...</think> <tool_call>...</tool_call> 19OR: <think>...</think> <answer>...</answer> 2...

work page