pith. sign in

arxiv: 2606.24292 · v1 · pith:NPFPTM3Wnew · submitted 2026-06-23 · 💻 cs.CV

ActiveScope: Actively Seeking and Correcting Perception for MLLMs

Pith reviewed 2026-06-26 01:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsfine-grained perceptiontraining-free methodsactive localizationself-correctionhigh-resolution imagessemantic anchorsattention suppression
0
0 comments X

The pith

ActiveScope lets MLLMs actively locate multiple targets and correct for distractions in high-resolution images without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces failures in existing training-free MLLM methods to contextual dominance, where distractors overwhelm attention, and semantic bias, where models fixate on the most salient concept and miss multiple targets. It builds a framework called ActiveScope that counters these issues with two modules: one that anchors localization to fine-grained semantics for each target independently, and one that refines the result by suppressing attention on interfering elements. If the approach holds, MLLMs gain reliable fine-grained perception on detailed scenes through inference-time search and correction rather than retraining.

Core claim

ActiveScope is a training-free framework that enhances MLLMs by actively seeking and correcting perception. Its Semantic Anchor Localization module uses fine-grained semantic anchors to independently localize key targets and thereby mitigate semantic bias. Its Interference-Suppressed Refinement module suppresses attention on salient distractions to overcome contextual dominance. Experiments on high-resolution image understanding benchmarks show it outperforms prior training-free methods, reaching 96.34 percent accuracy on V* Bench.

What carries the argument

Semantic Anchor Localization (SAL) and Interference-Suppressed Refinement (ISR) modules that together enable active target search and self-correction of perception.

If this is right

  • MLLMs can localize multiple targets independently instead of defaulting to the most salient object.
  • Attention to distracting elements is reduced during refinement, improving localization accuracy.
  • Fine-grained perception improves on high-resolution benchmarks without any model retraining.
  • The active-search-plus-self-correction pattern becomes a viable alternative to attention-based or coarse-to-fine methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same correction steps could be inserted into pipelines for other vision-language tasks that suffer from attention capture by salient but irrelevant content.
  • If the modules prove modular, they might be combined with existing localization tools to create hybrid inference strategies.
  • The results imply that perception errors in large models are often addressable at inference time rather than requiring architectural changes.

Load-bearing premise

That the main failures of prior methods arise from contextual dominance and semantic bias, and that the two new modules fix those problems without creating comparable new failure modes.

What would settle it

Running the same benchmarks and finding that ActiveScope accuracy does not exceed existing training-free baselines or that localization still misses multiple targets in high-resolution scenes.

Figures

Figures reproduced from arXiv: 2606.24292 by Chao Bi, Junshu Sun, Qingming Huang, Shufan Shen, Shuhui Wang, Yajing Wang, Zhaobo Qi.

Figure 1
Figure 1. Figure 1: (a) Schematic of the previous training-free method, which struggles to locate hard targets and adapt to multi-target scenarios. (b) ActiveScope supports the localization of multiple targets and can refine the initial positioning to the correct regions. (c) ActiveScope outperforms other methods on multiple benchmarks. areas to redistribute attention, thereby allowing the model to correct its focus. In terms… view at source ↗
Figure 3
Figure 3. Figure 3: (a) iterative localization: localize with Interference Sup￾pression strategy for two steps. (b) key token guidance: localize using attention maps from the key tokens corresponding to targets [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The framework of ActiveScope. SAL generates target-specific attention maps using semantic anchors, and ISR refines localization by suppressing interfering regions. The refined regions are assembled into a focused visual prompt for final prediction. region, guiding the model to shift its attentional focus to other potential areas for re-localization. This process repeats for a maximum of C cycles. In practi… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of localization results between ActiveScope and ZoomEye. Green and red boxes represent successful and failed localizations, respectively [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of iterative localization and self-correction across multiple target objects. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of iterative localization and self-correction across multiple target objects. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive vision-language understanding, yet still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. Our investigation attributes these failures to Contextual Dominance, where salient distractors overwhelm target attention and cause inaccurate localization, and Semantic Bias, where global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization in multi-object scenarios. Built on these insights, we propose ActiveScope, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The Semantic Anchor Localization (SAL) utilizes fine-grained semantic anchors to independently localize key targets, thereby mitigating semantic bias. The Interference-Suppressed Refinement (ISR) refines localization by suppressing attention on salient distractions to overcome contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods (e.g., 96.34 percent accuracy on $V^{*}$ Bench), validating the superiority of the active search and self-correction paradigm. Our code is available at https://github.com/jasmine-ww/ActiveScope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ActiveScope, a training-free framework for improving MLLMs' fine-grained perception on high-resolution images. It attributes failures in prior methods to Contextual Dominance (salient distractors overwhelming attention) and Semantic Bias (fixation on the most salient concept in multi-object scenes), and introduces two modules: Semantic Anchor Localization (SAL) to independently localize targets via fine-grained semantic anchors, and Interference-Suppressed Refinement (ISR) to suppress attention on distractions. The central claim is that these modules enable active search and self-correction, yielding superior results such as 96.34% accuracy on V* Bench over existing training-free baselines.

Significance. If the experimental results and attribution hold under proper controls, the work would be significant for demonstrating a training-free active-perception paradigm that could improve MLLM reliability on detailed visual tasks without retraining. The public code release supports reproducibility, which is a strength.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts performance gains (e.g., 96.34% on V* Bench) and attributes them to the SAL and ISR modules, yet supplies no experimental protocol, baselines, error bars, ablation details, or statistical tests; verification of the central claim is impossible from the given text.
  2. [Abstract] Abstract: The claim that prior-method failures stem from Contextual Dominance and Semantic Bias, and that SAL/ISR directly mitigate them without new failure modes, is presented as an axiom without supporting analysis, failure-case studies, or controls that would make the attribution load-bearing.
minor comments (1)
  1. [Abstract] The abstract mentions 'extensive experiments on high-resolution image understanding benchmarks' but provides no list of datasets, metrics, or comparison methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript. We address the major comments on the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts performance gains (e.g., 96.34% on V* Bench) and attributes them to the SAL and ISR modules, yet supplies no experimental protocol, baselines, error bars, ablation details, or statistical tests; verification of the central claim is impossible from the given text.

    Authors: Abstracts by convention provide a concise overview without full methodological details or statistics. The experimental protocol, baselines, ablations, and results are detailed in Sections 4 and 5 of the manuscript, with code released for reproducibility. The 96.34% figure is from the V* Bench evaluation reported there. revision: no

  2. Referee: [Abstract] Abstract: The claim that prior-method failures stem from Contextual Dominance and Semantic Bias, and that SAL/ISR directly mitigate them without new failure modes, is presented as an axiom without supporting analysis, failure-case studies, or controls that would make the attribution load-bearing.

    Authors: The abstract summarizes our investigation into these failure modes, which is elaborated with qualitative examples, quantitative analysis, and ablation studies in the main text (Introduction and Section 3). These demonstrate how SAL and ISR address the identified issues without introducing new failure modes, as shown by comparative results. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical training-free framework whose core claims rest on benchmark performance numbers (e.g., 96.34% on V* Bench) rather than any derivation, equation, or fitted parameter. The attribution of failures to Contextual Dominance and Semantic Bias is presented as an investigative finding that motivates the design of SAL and ISR; those modules are not shown to be defined in terms of the performance metric they are later evaluated on. No self-citations, uniqueness theorems, or ansatzes appear in the supplied text. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no information on free parameters, background axioms, or newly postulated entities; all arrays left empty.

pith-pipeline@v0.9.1-grok · 5778 in / 1059 out tokens · 29778 ms · 2026-06-26T01:04:28.626816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 9 linked inside Pith

  1. [1]

    L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision- language model for understanding, localization, text reading, and beyond.a...

  3. [3]

    Expanding per- formance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024a

    Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding per- formance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024a. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Intern...

  4. [4]

    Vision transformers need registers

    Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. InInternational Conference on Learning Representations, volume 2024, pp. 2632– 2652,

  5. [5]

    Convllava: Hierar- chical backbones as visual encoder for large multimodal models.arXiv preprint arXiv:2405.15738,

    Ge, C., Cheng, S., Wang, Z., Yuan, J., Gao, Y ., Song, J., Song, S., Huang, G., and Zheng, B. Convllava: Hierar- chical backbones as visual encoder for large multimodal models.arXiv preprint arXiv:2405.15738,

  6. [6]

    Hide: Rethinking the zoom-in method in high resolu- tion mllms via hierarchical decoupling.arXiv preprint arXiv:2510.00054,

    Liu, X., Hu, Y ., Zou, Y ., Wu, L., Xu, J., and Zheng, B. Hide: Rethinking the zoom-in method in high resolu- tion mllms via hierarchical decoupling.arXiv preprint arXiv:2510.00054,

  7. [7]

    Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966,

    Liu, Z., Dong, Y ., Rao, Y ., Zhou, J., and Lu, J. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966,

  8. [8]

    Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models

    Luo, G., Zhou, Y ., Zhang, Y ., Zheng, X., Sun, X., and Ji, R. Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. InInternational Conference on Learning Representations, volume 2025, pp. 84491–84506,

  9. [9]

    Through the magnifying glass: Adaptive perception magnification for hallucination-free vlm decoding.arXiv preprint arXiv:2503.10183,

    Mao, S., Zhang, C., and Cai, W. Through the magnifying glass: Adaptive perception magnification for hallucination-free vlm decoding.arXiv preprint arXiv:2503.10183,

  10. [10]

    Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based im- age exploration

    Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., and Yin, J. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based im- age exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 6613–6629,

  11. [11]

    Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

  12. [12]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  13. [13]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388,

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  14. [14]

    and Zhang, K

    Yang, F. and Zhang, K. Mrd: Multi-resolution retrieval- detection fusion for high-resolution image understanding. arXiv preprint arXiv:2512.02906,

  15. [15]

    J., and Ma, Y

    Zhai, Y ., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y . J., and Ma, Y . Investigating the catastrophic forgetting in multimodal large language models.arXiv preprint arXiv:2309.10313,

  16. [16]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms

    Zhang, J., Khayatkhoei, M., Chhikara, P., and Ilievski, F. Mllms know where to look: Training-free perception of small visual details with multimodal llms. InInterna- tional Conference on Learning Representations, volume 2025, pp. 68194–68213, 2025a. Zhang, Y ., Zhang, H., Tian, H., Fu, C., Zhang, S., Wu, J., Li, F., Wang, K., Wen, Q., Zhang, Z., et al. M...

  17. [17]

    Look where it matters: Training-free ultra-hr remote sensing vqa via adaptive zoom search.arXiv preprint arXiv:2511.20460,

    Zhou, Y ., Jiang, C., Yuan, C., and Li, J. Look where it matters: Training-free ultra-hr remote sensing vqa via adaptive zoom search.arXiv preprint arXiv:2511.20460,

  18. [18]

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  19. [19]

    contextual dominance

    11 ActiveScope: Actively Seeking and Correcting Perception for MLLMs A. Algorithm of ActiveScope The pseudocode for the ActiveScope method is provided below. Algorithm 1Complete Algorithm Workflow of ActiveScope Require:ImageI, QueryQ, MLLM maskM, Iteration LimitC Ensure:Final responseY response 1:S← M.generate(I, p extract) 2:S bbox ← ∅ 3:Initialize Atte...