arxiv: 2512.10362 · v2 · submitted 2025-12-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung , Jaehoon Go , Mingyu Jeon , Sunjae Yoon , Junyeong Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords contextual blindnessvisual funnelmultimodal LLMsimage croppingattention entropyhierarchical contexttraining-free methodstructural diversity

0 comments

The pith

Visual Funnel resolves contextual blindness in MLLMs by building an entropy-scaled hierarchical crop portfolio that preserves structural diversity across scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often fail at fine-grained visual tasks because cropping salient regions creates a structural disconnect between high-fidelity details and the broader image context. The paper names this failure Contextual Blindness and traces it to missing structural diversity in the input rather than insufficient information quantity. Visual Funnel counters it with a training-free two-step process: Contextual Anchoring locates the region of interest in one pass, then an Entropy-Scaled Portfolio dynamically sizes and centers additional crops according to attention entropy to maintain a hierarchy from focal detail outward. Experiments show the method beats both single-crop and unstructured multi-crop baselines, confirming that organized hierarchy, not crop count, drives the gain. This matters for any precision task where models must integrate local evidence with global scene structure without retraining.

Core claim

We identify Contextual Blindness as the structural disconnect between high-fidelity details from crops and global context from the original image. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves hierarchical context ranging from focal detail to broader surroundings by dynamically determining crop sizes based on attention entropy and refining crop centers. Extensive experiments show this significantly outperforms naive single-crop and unstructured multi-crop baselines, confirming that the hierarchical structure is key to resolving Contextual Blindness.

What carries the argument

Entropy-Scaled Portfolio: a collection of dynamically sized crops whose dimensions are set by attention entropy to enforce a hierarchy that links focal high-resolution detail to surrounding lower-resolution context without training.

Load-bearing premise

Attention entropy can be trusted to select crop sizes and centers that restore structural diversity without creating fresh disconnects or overlooking essential information outside the anchored region.

What would settle it

An experiment in which random or fixed-size multi-crop sets match or exceed the entropy-scaled portfolio on tasks that require both fine detail and global context would falsify the claim that the hierarchical entropy mechanism is necessary.

Figures

Figures reproduced from arXiv: 2512.10362 by Jaehoon Go, Junyeong Kim, Mingyu Jeon, Sunjae Yoon, Woojun Jung.

**Figure 1.** Figure 1: Illustration of Contextual Blindness and our proposed solution, Visual Funnel. (a) Prior single-crop methods successfully localize the area of interest but perform a naive integration by providing only a tight crop. This isolates the detail from its necessary context, leading to an incorrect answer (e.g., misidentifying a ‘parking space’ as a ‘lane’). (b) Our Visual Funnel performs a more sophisticated… view at source ↗

**Figure 2.** Figure 2: Examples of Contextual Blindness. Single-crop methods systematically remove essential context needed for correct reasoning, even when tight crops (red boxes) successfully isolate fine-grained details. Top row: (left) cropping only seated diners leads to overlooking the standing person; (center) excluding background reference objects results in incorrect height judgments; (right) different object positions … view at source ↗

**Figure 3.** Figure 3: An overview of our proposed Visual Funnel methodology. Visual Funnel resolves Contextual Blindness through a two-step, training-free process. (1) Contextual Anchoring: A localization-focused query guides the MLLM to establish a semantic anchor by generating a precise spatial attention map for the region of interest. (2) Entropy-Scaled Portfolio Generation: This attention map then serves as the foundation f… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between the Single-Crop baseline and Visual Funnel. We visualize the inputs and predictions using Qwen2.5-VL-3B-Instruct. The Red box represents the input for the standard Single-Crop baseline (w/ ViCrop), while the Blue, Green, and Purple boxes represent the hierarchical portfolio (Focal, Immediate, Broader context) used in Visual Funnel. (Top two rows) Success Cases: Visual Funnel … view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Visual Funnel proposes a training-free hierarchical cropping method based on attention entropy to fix contextual blindness in MLLMs, but the reported experiments need clearer baselines and numbers to confirm the claimed mechanism.

read the letter

The core of this paper is a training-free method they call Visual Funnel. It first finds a region of interest through contextual anchoring in one pass, then builds what they term an entropy-scaled portfolio of crops. The sizes and centers of these crops are adjusted based on attention entropy to keep a range from tight detail to wider context. They argue this restores structural diversity that simple cropping destroys, and they claim it beats both single crops and unstructured multi-crops. What the work does reasonably well is to isolate a specific failure mode. Many papers just say MLLMs miss details; here they point out that even when the info is there, the way the inputs are structured can break the model's ability to relate parts to the whole. The training-free aspect is also practical, since it only uses attention maps the model already computes. The main weakness right now is that we have no actual numbers. The abstract says it significantly outperforms the baselines and that adding unstructured crops gives limited or negative returns, but without tables, datasets, or ablations it is impossible to judge the size of the effect or whether the controls are fair. The stress-test note raises a fair point: if the unstructured multi-crop baseline does not also use the anchored center, then the comparison does not cleanly test the entropy scaling. It could be that any method that first picks a good center would look better. The paper should include an ablation that keeps the anchor fixed across all conditions to show the hierarchy itself matters. If those controls are in the full manuscript and the gains hold up on relevant benchmarks, this could be a simple addition for applications that need fine visual reasoning. People building systems for chart reading or medical image analysis might get value from trying it. I would bring it to a reading group to discuss the anchoring step and how entropy is computed in practice. It is solid enough to send out for peer review, mainly so referees can check the experimental design and the quantitative claims.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs suffer from Contextual Blindness due to structural disconnects when using crops for fine-grained details, and proposes Visual Funnel: a training-free method that first performs Contextual Anchoring via attention to identify regions of interest, then builds an Entropy-Scaled Portfolio of hierarchically sized crops (determined dynamically from attention entropy) to restore structural diversity while preserving global context. Experiments are said to show significant outperformance over naive single-crop and unstructured multi-crop baselines, with the hierarchical portfolio structure identified as the key factor.

Significance. If the results hold after clarifying baselines, the work offers a simple, training-free way to improve MLLM visual reasoning on precision tasks by leveraging existing attention maps. The emphasis on structural diversity over mere information quantity is a useful conceptual contribution, and the absence of fitted parameters strengthens potential reproducibility.

major comments (2)

[Experiments] Experiments section: the central claim that 'the hierarchical structure of the entropy-scaled portfolio is key' depends on the unstructured multi-crop baseline. Specify whether this baseline also centers crops on the Contextual Anchoring output or samples randomly across the full image. If the latter, the comparison confounds anchoring benefits with entropy-driven scaling, undermining attribution to the portfolio structure.
[Method] Method section (entropy scaling step): the assumption that attention entropy reliably determines crop sizes and centers without introducing new disconnects lacks supporting analysis. Provide the exact entropy computation formula, the scaling rule for crop sizes, and any ablation on alternative entropy thresholds or center refinement heuristics.

minor comments (2)

[Abstract] Abstract: reports outperformance but omits all quantitative metrics, datasets, error bars, or ablation summaries; add at least the key performance deltas and dataset names for immediate context.
[Method] Notation: define 'attention entropy' and 'portfolio' explicitly on first use, and ensure consistent terminology between Contextual Anchoring and the subsequent crop construction step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the attribution of our results and strengthen the methodological details. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Experiments section: the central claim that 'the hierarchical structure of the entropy-scaled portfolio is key' depends on the unstructured multi-crop baseline. Specify whether this baseline also centers crops on the Contextual Anchoring output or samples randomly across the full image. If the latter, the comparison confounds anchoring benefits with entropy-driven scaling, undermining attribution to the portfolio structure.

Authors: We agree this distinction is important for proper attribution. In the current experiments, the unstructured multi-crop baseline samples crops randomly across the full image (without Contextual Anchoring). To isolate the contribution of the hierarchical entropy-scaled structure, we have run additional controls that apply the same anchoring centers but use uniform or random sizes instead of entropy scaling. These results show that the entropy-driven hierarchical sizing still yields measurable gains over anchored but unstructured crops, supporting our claim. We will add these controls, clarify the baseline construction, and update the experiments section and appendix in the revision. revision: yes
Referee: Method section (entropy scaling step): the assumption that attention entropy reliably determines crop sizes and centers without introducing new disconnects lacks supporting analysis. Provide the exact entropy computation formula, the scaling rule for crop sizes, and any ablation on alternative entropy thresholds or center refinement heuristics.

Authors: We acknowledge that the entropy computation and scaling rule were described at a high level rather than with explicit formulas. The attention entropy is computed as the Shannon entropy H = -∑(p_i * log p_i) over the normalized attention weights p_i of the image patches from the final layer. Crop sizes are then scaled as size = min_size + (max_size - min_size) * (H / H_max), with centers refined by shifting toward the attention-weighted centroid within a small radius. We will insert the precise formulas, pseudocode, and ablation results (varying entropy thresholds and refinement heuristics) into the Method section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a training-free pipeline that directly consumes attention maps already generated by the base MLLM, computes entropy to set crop sizes and centers, and evaluates against explicit baselines. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The central claim (hierarchical entropy scaling outperforms unstructured crops) is supported by experimental comparison rather than by construction from the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that attention maps already computed by MLLMs contain sufficient signal for both region anchoring and entropy-based scale selection; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)

domain assumption Attention entropy from the base MLLM can be used to determine crop sizes that restore structural diversity
Invoked in the Entropy-Scaled Portfolio construction step described in the abstract.

pith-pipeline@v0.9.0 · 5545 in / 1280 out tokens · 66262 ms · 2026-05-16T23:43:47.918934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 10 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jia- peng Luo, Zheng Ma, et al. InternVL2: Better than the best—redefining open-source multimodal models with In- ternVL2.5.arXiv preprint arXiv:2412.05271, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2021. 3

work page 2021
[5]

Making the V in VQA matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 6904–6913,

work page
[6]

Visual pro- gramming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2023. 3

work page arXiv 2023
[7]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 6

work page 2019
[8]

Words over pixels? rethinking vision in multimodal large language models

Anubhooti Jain, Mayank Vatsa, and Richa Singh. Words over pixels? rethinking vision in multimodal large language models. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Sur- vey Track, pages 10481–10489, 2025. 3

work page 2025
[9]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing.arXiv preprint arXiv:2304.02643, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models.arXiv preprint arXiv:2311.06607,

work page arXiv
[13]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

LLaV A-NeXT: Im- proved reasoning, OCR, and world knowledge.https: / / llava - vl

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Im- proved reasoning, OCR, and world knowledge.https: / / llava - vl . github . io / blog / 2024 - 01 - 30 - llava-next/, 2024. 3

work page 2024
[15]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 2200–2209, 2021. 6

work page 2021
[16]

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. InfoGraph- icVQA. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 1697–1706,

work page
[17]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024
[18]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016. 3

work page 2016
[19]

A-OKVQA: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowl- edge. InEuropean Conference on Computer Vision, pages 146–162, 2022. 6

work page 2022
[20]

Scaling vision pre-training to 4k resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, and Hongxu Yin. Scaling vision pre-training to 4k resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[21]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 6

work page 2019
[22]

ViperGPT: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023. 3

work page arXiv 2023
[23]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akkiraju, Jihan Xiao, Silvio Savarese, Yann LeCun, Bart Oguz, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. arXiv preprint arXiv:2406.16860, 2024. 3

work page internal anchor Pith review arXiv 2024
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Vary: Scaling up the vision vocab- ulary for large vision-language models

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models.arXiv preprint arXiv:2312.06109,

work page arXiv
[26]

Controlmllm: Training-free visual prompt learning for multimodal large language models

Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, and Ron- grong Ji. Controlmllm: Training-free visual prompt learning for multimodal large language models. InAdvances in Neu- ral Information Processing Systems, 2024. 3

work page 2024
[27]

V*: Guided visual search as a core mechanism in multimodal LLMs

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2, 3, 4

work page 2024
[28]

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. LLaV A-UHD: An LMM perceiving any aspect ratio and high-resolution images.arXiv preprint arXiv:2403.11703, 2024. 3

work page arXiv 2024
[29]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V.arXiv preprint arXiv:2310.11441, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Exploring perceptual limitation of multi- modal large language models, 2024

Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, and Maosong Sun. Exploring perceptual limitation of multi- modal large language models, 2024. 3

work page 2024
[31]

Towards perceiving small visual details in zero-shot visual question answering with multimodal LLMs,

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Towards perceiving small visual details in zero-shot visual question answering with multimodal LLMs,

work page
[32]

magic numbers,

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2, 3, 4, 5, 6 A. Hyperparameter Sensitivity Analysis In Section 3.2.2 of the main paper, we introduced the E...

work page 2025