Recognition: no theorem link
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Pith reviewed 2026-05-16 23:43 UTC · model grok-4.3
The pith
Visual Funnel resolves contextual blindness in MLLMs by building an entropy-scaled hierarchical crop portfolio that preserves structural diversity across scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identify Contextual Blindness as the structural disconnect between high-fidelity details from crops and global context from the original image. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves hierarchical context ranging from focal detail to broader surroundings by dynamically determining crop sizes based on attention entropy and refining crop centers. Extensive experiments show this significantly outperforms naive single-crop and unstructured multi-crop baselines, confirming that the hierarchical structure is key to resolving Contextual Blindness.
What carries the argument
Entropy-Scaled Portfolio: a collection of dynamically sized crops whose dimensions are set by attention entropy to enforce a hierarchy that links focal high-resolution detail to surrounding lower-resolution context without training.
Load-bearing premise
Attention entropy can be trusted to select crop sizes and centers that restore structural diversity without creating fresh disconnects or overlooking essential information outside the anchored region.
What would settle it
An experiment in which random or fixed-size multi-crop sets match or exceed the entropy-scaled portfolio on tasks that require both fine detail and global context would falsify the claim that the hierarchical entropy mechanism is necessary.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that MLLMs suffer from Contextual Blindness due to structural disconnects when using crops for fine-grained details, and proposes Visual Funnel: a training-free method that first performs Contextual Anchoring via attention to identify regions of interest, then builds an Entropy-Scaled Portfolio of hierarchically sized crops (determined dynamically from attention entropy) to restore structural diversity while preserving global context. Experiments are said to show significant outperformance over naive single-crop and unstructured multi-crop baselines, with the hierarchical portfolio structure identified as the key factor.
Significance. If the results hold after clarifying baselines, the work offers a simple, training-free way to improve MLLM visual reasoning on precision tasks by leveraging existing attention maps. The emphasis on structural diversity over mere information quantity is a useful conceptual contribution, and the absence of fitted parameters strengthens potential reproducibility.
major comments (2)
- [Experiments] Experiments section: the central claim that 'the hierarchical structure of the entropy-scaled portfolio is key' depends on the unstructured multi-crop baseline. Specify whether this baseline also centers crops on the Contextual Anchoring output or samples randomly across the full image. If the latter, the comparison confounds anchoring benefits with entropy-driven scaling, undermining attribution to the portfolio structure.
- [Method] Method section (entropy scaling step): the assumption that attention entropy reliably determines crop sizes and centers without introducing new disconnects lacks supporting analysis. Provide the exact entropy computation formula, the scaling rule for crop sizes, and any ablation on alternative entropy thresholds or center refinement heuristics.
minor comments (2)
- [Abstract] Abstract: reports outperformance but omits all quantitative metrics, datasets, error bars, or ablation summaries; add at least the key performance deltas and dataset names for immediate context.
- [Method] Notation: define 'attention entropy' and 'portfolio' explicitly on first use, and ensure consistent terminology between Contextual Anchoring and the subsequent crop construction step.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the attribution of our results and strengthen the methodological details. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Experiments section: the central claim that 'the hierarchical structure of the entropy-scaled portfolio is key' depends on the unstructured multi-crop baseline. Specify whether this baseline also centers crops on the Contextual Anchoring output or samples randomly across the full image. If the latter, the comparison confounds anchoring benefits with entropy-driven scaling, undermining attribution to the portfolio structure.
Authors: We agree this distinction is important for proper attribution. In the current experiments, the unstructured multi-crop baseline samples crops randomly across the full image (without Contextual Anchoring). To isolate the contribution of the hierarchical entropy-scaled structure, we have run additional controls that apply the same anchoring centers but use uniform or random sizes instead of entropy scaling. These results show that the entropy-driven hierarchical sizing still yields measurable gains over anchored but unstructured crops, supporting our claim. We will add these controls, clarify the baseline construction, and update the experiments section and appendix in the revision. revision: yes
-
Referee: Method section (entropy scaling step): the assumption that attention entropy reliably determines crop sizes and centers without introducing new disconnects lacks supporting analysis. Provide the exact entropy computation formula, the scaling rule for crop sizes, and any ablation on alternative entropy thresholds or center refinement heuristics.
Authors: We acknowledge that the entropy computation and scaling rule were described at a high level rather than with explicit formulas. The attention entropy is computed as the Shannon entropy H = -∑(p_i * log p_i) over the normalized attention weights p_i of the image patches from the final layer. Crop sizes are then scaled as size = min_size + (max_size - min_size) * (H / H_max), with centers refined by shifting toward the attention-weighted centroid within a small radius. We will insert the precise formulas, pseudocode, and ablation results (varying entropy thresholds and refinement heuristics) into the Method section and supplementary material. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper presents a training-free pipeline that directly consumes attention maps already generated by the base MLLM, computes entropy to set crop sizes and centers, and evaluates against explicit baselines. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The central claim (hierarchical entropy scaling outperforms unstructured crops) is supported by experimental comparison rather than by construction from the method's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention entropy from the base MLLM can be used to determine crop sizes that restore structural diversity
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jia- peng Luo, Zheng Ma, et al. InternVL2: Better than the best—redefining open-source multimodal models with In- ternVL2.5.arXiv preprint arXiv:2412.05271, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2021. 3
work page 2021
-
[5]
Making the V in VQA matter: El- evating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 6904–6913,
-
[6]
Visual pro- gramming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2023. 3
-
[7]
Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 6
work page 2019
-
[8]
Words over pixels? rethinking vision in multimodal large language models
Anubhooti Jain, Mayank Vatsa, and Richa Singh. Words over pixels? rethinking vision in multimodal large language models. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Sur- vey Track, pages 10481–10489, 2025. 3
work page 2025
-
[9]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing.arXiv preprint arXiv:2304.02643, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models.arXiv preprint arXiv:2311.06607,
-
[13]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LLaV A-NeXT: Im- proved reasoning, OCR, and world knowledge.https: / / llava - vl
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Im- proved reasoning, OCR, and world knowledge.https: / / llava - vl . github . io / blog / 2024 - 01 - 30 - llava-next/, 2024. 3
work page 2024
-
[15]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 2200–2209, 2021. 6
work page 2021
-
[16]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. InfoGraph- icVQA. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 1697–1706,
-
[17]
Compositional chain-of-thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 3
work page 2024
-
[18]
You only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016. 3
work page 2016
-
[19]
A-OKVQA: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowl- edge. InEuropean Conference on Computer Vision, pages 146–162, 2022. 6
work page 2022
-
[20]
Scaling vision pre-training to 4k resolution
Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, and Hongxu Yin. Scaling vision pre-training to 4k resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[21]
Towards VQA models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 6
work page 2019
-
[22]
ViperGPT: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023
D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023. 3
-
[23]
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akkiraju, Jihan Xiao, Silvio Savarese, Yann LeCun, Bart Oguz, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. arXiv preprint arXiv:2406.16860, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[24]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Vary: Scaling up the vision vocab- ulary for large vision-language models
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models.arXiv preprint arXiv:2312.06109,
-
[26]
Controlmllm: Training-free visual prompt learning for multimodal large language models
Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, and Ron- grong Ji. Controlmllm: Training-free visual prompt learning for multimodal large language models. InAdvances in Neu- ral Information Processing Systems, 2024. 3
work page 2024
-
[27]
V*: Guided visual search as a core mechanism in multimodal LLMs
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2, 3, 4
work page 2024
-
[28]
Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. LLaV A-UHD: An LMM perceiving any aspect ratio and high-resolution images.arXiv preprint arXiv:2403.11703, 2024. 3
-
[29]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V.arXiv preprint arXiv:2310.11441, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Exploring perceptual limitation of multi- modal large language models, 2024
Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, and Maosong Sun. Exploring perceptual limitation of multi- modal large language models, 2024. 3
work page 2024
-
[31]
Towards perceiving small visual details in zero-shot visual question answering with multimodal LLMs,
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Towards perceiving small visual details in zero-shot visual question answering with multimodal LLMs,
-
[32]
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2, 3, 4, 5, 6 A. Hyperparameter Sensitivity Analysis In Section 3.2.2 of the main paper, we introduced the E...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.