arxiv: 2604.08456 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.CL

Recognition: unknown

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gr\"opl , Jaewoo Jung , Seungryong Kim , Marc Pollefeys , Sunghwan Hong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:11 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords entropy-gradient groundingvision-language modelstraining-free groundingevidence retrievaluncertainty supervisiontest-time groundingrelevance mapmulti-evidence queries

0 comments

The pith

Vision-language models ground answers to visual details by backpropagating next-token entropy without training or external detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that uncertainty in a VLM's next-token predictions already encodes where the model needs to look in an image to resolve a query. By computing the entropy of the output distribution and backpropagating it to the visual token embeddings, the method produces a relevance map that highlights supporting regions. This map is used to extract and rank multiple coherent evidence areas for queries that span several parts of an image or document. An iterative zoom procedure refines the map while a spatial-entropy rule decides when to stop. The approach is tested on seven benchmarks spanning four model families and shows gains especially when answers depend on fine details or high-resolution inputs.

Core claim

Grounding is reframed as test-time evidence retrieval in which the entropy of the model's next-token distribution serves as an intrinsic supervision signal. Backpropagating this entropy through the visual embeddings yields a relevance map that identifies image regions needed to reduce predictive uncertainty, allowing extraction of ranked multi-region evidence without attention heuristics or auxiliary models. An iterative zoom-and-reground loop with a spatial-entropy stopping criterion prevents over-refinement while preserving necessary detail.

What carries the argument

The entropy-gradient relevance map produced by backpropagating next-token entropy to visual token embeddings.

If this is right

Consistent accuracy gains appear on detail-critical and high-resolution tasks across seven benchmarks and four VLM architectures.
Multiple coherent regions can be extracted and ranked to support compositional or multi-evidence queries.
The iterative zoom procedure with spatial-entropy stopping improves localization without requiring extra training.
Evidence maps are produced directly from model internals, increasing interpretability of the grounding process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy signal might be usable to decide when a VLM should request higher-resolution crops or additional views during inference.
Extending the backpropagation to other uncertainty measures, such as token-level variance, could yield complementary relevance maps.
The method offers a route to focus computation on relevant image patches, potentially lowering inference cost for high-resolution inputs.

Load-bearing premise

Backpropagating next-token entropy to visual embeddings produces a relevance map that reliably marks the image regions needed to answer the query.

What would settle it

On a benchmark of detail-critical queries, the entropy-gradient maps show no better alignment with human-annotated evidence regions than random or attention-based baselines, and answer accuracy does not improve.

Figures

Figures reproduced from arXiv: 2604.08456 by Jaewoo Jung, Marcel Gr\"opl, Marc Pollefeys, Seungryong Kim, Sunghwan Hong.

**Figure 1.** Figure 1: Teaser. As shown in (a) and (b), existing VLMs struggle to answer questions when visual evidence is fine-grained or exists in spatially disjoint regions. We propose a training-free method where we apply a query-based visual grounding method to discover relevant regions and provide these regions as additional image crops, improving performance in both challenging scenarios. Abstract. Despite rapid progress,… view at source ↗

**Figure 2.** Figure 2: Overview of proposed method. (a) Given an image and prompt, we obtain an initial region-of-interest by backpropagating the entropy of the next-token distribution to visual embeddings, producing an entropy-gradient relevance map. (b) We iteratively re-ground and re-crop the most informative regions, stopping when the spatial entropy criterion indicates further refinement no longer improves spatial concent… view at source ↗

**Figure 3.** Figure 3: Relative attention vs. entropy-gradient. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visual grounding via entropy-gradient map. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples of iterative and multi-region grounding. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of entropy-based gradient maps across different VLMs. We observe that weaker models (e.g., LLaVA 1.5) more frequently allocate gradients to spatially misaligned regions, whereas stronger models produce better-localized attributions. Among the compared models, LLaVA 1.6 generally yields the cleanest and most coherent gradient maps. 4.2 Experimental Results Tab. 1 reports results on f… view at source ↗

**Figure 7.** Figure 7: Layer-wise maximum average gradient magnitude on TextVQA, illustrating how gradient strength evolves with model depth. Subsequent-token loss backpropagation. We study whether gradients from later decoding steps provide a better grounding signal than those from the first step. Tab. 6 reports results when backpropagating the loss from the t-th generated token (t ∈ {1, 2, 3, 4}) while keeping all other compo… view at source ↗

**Figure 8.** Figure 8: Example of gradient computation at different token positions. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of gradient-based grounding methods on [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Additional Examples. Qualitative examples on Llava 1.5. The most important final crop is highlighted in red [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Additional Examples. Qualitative examples on Llava 1.6. The most important final crop is highlighted in red. What is the color of the dustpan? (A) Purple (B) red (C) Blue (D) white D C LLaVA 1.6 LLaVA 1.6 w/ours What is the color of the shovel? (A) yellow (B) red (C) Blue (D) black D C LLaVA 1.6 LLaVA 1.6 w/ours What is the color of the SUV car? (A) black (B) red (C) silver (D) blue D C LLaVA 1.6 LLaVA 1… view at source ↗

**Figure 12.** Figure 12: Additional Examples. Qualitative examples on native resolution encodings (QWEN 2.5, InternVL 3.5). The most important final crop is highlighted in red [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Failure cases. Qualitative examples of failing to predict the correct answer illustrating a limitation of our method. Even when confronted with a very detailed crop, the model still fails to answer correctly [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Entropy-gradient grounding gives a training-free way to pull evidence from VLMs and reports gains on detail-heavy benchmarks, but the gradient may track global uncertainty more than query-specific visual tokens.

read the letter

The key takeaway is that backpropagating next-token entropy to visual token embeddings offers a training-free way to retrieve evidence in VLMs, and the experiments show it helps on detail-heavy tasks. The new part is treating grounding as active evidence retrieval at test time. They compute entropy on the model's output distribution, take its gradient w.r.t. the image tokens after the vision encoder, and use that as a relevance map. From there they pull out multiple regions for multi-part queries and run an iterative zoom with a spatial entropy stop to prevent overdoing it. No extra models or fine-tuning required. This approach works well in the reported results. Gains appear across seven benchmarks and four VLM architectures, with bigger improvements when queries need fine details or high-res images. The method stays inside the pretrained model, which keeps it lightweight and general. The main concern is whether the gradient map actually captures query-specific visual evidence. Because the visual tokens mix with text inside the LLM, the entropy gradient can flow through cross-attention and pick up on anything that raises overall uncertainty, including language ambiguity or irrelevant image areas. The abstract does not mention controlled tests that would isolate the visual contribution, such as ablating the language path or direct comparison to attention-based maps. The spatial-entropy stopping rule carries the same risk of reacting to global rather than local signals. Without error bars, full method details, or data split info, it is difficult to assess how reliable the improvements are. This work will interest people building applications in document understanding, visual question answering, and compositional reasoning who need better grounding without retraining. Readers looking for uncertainty-driven test-time methods should find the framing useful. It is worth sending to peer review because the idea is straightforward to implement and the multi-architecture evaluation gives a reasonable starting point, though the authors will probably need to add ablations to address the specificity question.

Referee Report

3 major / 2 minor

Summary. The paper proposes Entropy-Gradient Grounding, a training-free, model-intrinsic method for evidence retrieval in vision-language models. It computes the entropy of the next-token distribution and backpropagates this to visual token embeddings to produce a relevance map, extracts and ranks coherent regions for multi-evidence queries, and applies an iterative zoom-and-reground procedure with a spatial-entropy stopping rule. Experiments report consistent gains over baselines on seven benchmarks across four VLM architectures, with largest improvements in detail-critical and high-resolution settings.

Significance. If the central claims hold, the approach provides a practical, training-free alternative to attention heuristics or auxiliary detectors for improving VLM grounding on compositional and detail-dependent tasks. The model-intrinsic use of predictive entropy as supervision and the iterative procedure are strengths that could enhance interpretability without retraining.

major comments (3)

[Experiments] Experiments (no section number given; results paragraph): The abstract and results claim consistent improvements on seven benchmarks, but no data splits, error bars, full ablation tables, or statistical significance tests are described. This makes it impossible to verify whether the reported gains are robust or attributable to the entropy-gradient procedure rather than implementation details.
[Method] Method (gradient definition): The relevance map is obtained by backpropagating next-token entropy H(p(·|query, image tokens)) to visual token embeddings after the vision encoder. No controlled ablation (e.g., zeroing language-side gradients or comparing to attention rollout) is provided to demonstrate that the map isolates query-specific visual evidence rather than global uncertainty or cross-modal artifacts.
[Iterative zoom-and-reground] Iterative procedure (spatial-entropy stopping rule): The stopping criterion based on spatial entropy is introduced to avoid over-refinement, but no analysis or ablation shows it reliably halts when key local evidence remains unexamined versus when global entropy plateaus due to non-visual factors.

minor comments (2)

[Method] Notation for the entropy-gradient map and region extraction steps could be formalized with equations for reproducibility.
[Abstract] The abstract mentions 'four VLM architectures' but does not list them explicitly; this should be stated in the experiments section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional details and analyses can strengthen the manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: The abstract and results claim consistent improvements on seven benchmarks, but no data splits, error bars, full ablation tables, or statistical significance tests are described. This makes it impossible to verify whether the reported gains are robust or attributable to the entropy-gradient procedure rather than implementation details.

Authors: We agree that the current manuscript lacks sufficient experimental protocol details. In the revision, we will explicitly describe the data splits for each benchmark, report error bars as standard deviations over multiple random seeds, expand all ablations into complete tables, and include statistical significance tests (paired t-tests with p-values) comparing our method against baselines. These changes will allow verification that gains are attributable to the proposed procedure. revision: yes
Referee: The relevance map is obtained by backpropagating next-token entropy H(p(·|query, image tokens)) to visual token embeddings after the vision encoder. No controlled ablation (e.g., zeroing language-side gradients or comparing to attention rollout) is provided to demonstrate that the map isolates query-specific visual evidence rather than global uncertainty or cross-modal artifacts.

Authors: While cross-architecture consistency and task-specific gains provide indirect support for query-specific isolation, we acknowledge the value of direct controls. The revision will add ablations that zero language-side gradients during backpropagation and compare entropy-gradient maps against attention rollout and random baselines. These will quantify how much the maps depend on query-visual interactions versus global factors. revision: yes
Referee: The stopping criterion based on spatial entropy is introduced to avoid over-refinement, but no analysis or ablation shows it reliably halts when key local evidence remains unexamined versus when global entropy plateaus due to non-visual factors.

Authors: The spatial-entropy rule is motivated by monitoring local uncertainty reduction, but we agree that targeted validation is needed. The revision will include an ablation comparing performance with and without the stopping rule, plus qualitative case studies showing halt points on queries with remaining unexamined evidence. This will demonstrate that the criterion stops appropriately rather than due to non-visual plateaus. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a direct computational procedure without self-referential definitions or fitted predictions

full rationale

The paper defines the entropy-gradient relevance map explicitly as the gradient of next-token entropy H(p(·|query, image tokens)) with respect to visual token embeddings after the vision encoder. This is a first-principles computation on the pretrained VLM forward pass and does not reduce to any fitted parameter, self-definition, or prior result by the same authors. No equations or sections invoke self-citations as load-bearing justification, no ansatz is smuggled, and no known empirical pattern is merely renamed. The iterative zoom-and-reground procedure with spatial-entropy stopping rule is likewise a stated algorithmic rule, not a tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that next-token entropy serves as a useful supervision signal for visual grounding; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Entropy of the next-token distribution reflects visual ambiguity that can be resolved by attending to specific image regions.
Invoked when using entropy backpropagation to generate the relevance map for evidence retrieval.

pith-pipeline@v0.9.0 · 5485 in / 1201 out tokens · 63239 ms · 2026-05-10T17:11:12.628777+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 22 canonical work pages · 8 internal anchors

[1]

C3G: Learning Compact 3D Representations with 2K Gaussians

An, H., Jung, J., Kim, M., Hong, S., Kim, C., Fukuda, K., Jeon, M., Han, J., Narihira, T., Ko, H., et al.: C3g: Learning compact 3d representations with 2k gaussians. arXiv preprint arXiv:2512.04021 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, H., Kim, J.H., Park, S., Jung, J., Han, J., Hong, S., Kim, S.: Cross-view completion models are zero-shot correspondence estimators. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1103–1115 (2025)

2025
[3]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., Wu, C., Tan, H., Li, C., Yang, J., Yu, J., Wang, X., Qin, B., Wang, Y., Yan, Z., Feng, Z., Liu, Z., Li, B., Deng, J.: Llava-onevision-1.5: Fully open framework for democratized multimodal training (2025),https://arxiv. org/abs/2509.23661

work page internal anchor Pith review arXiv 2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hen- drix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, ...

work page internal anchor Pith review arXiv 2024
[7]

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

Gurbuz, A.S., Hong, S., Nassar, A., Pollefeys, M., Staar, P.: Moving be- yond sparse grounding with complete screen parsing supervision. arXiv preprint arXiv:2602.14276 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

Han, J., Hong, S., Jung, J., Jang, W., An, H., Wang, Q., Kim, S., Feng, C.: Emergent outlier view rejection in visual geometry grounded transformers. arXiv preprint arXiv:2512.04012 (2025)

work page arXiv 2025
[9]

arXiv preprint arXiv:2403.11120 (2024) 3

Hong, S., Cho, S., Kim, S., Lin, S.: Unifying feature and cost aggrega- tion with transformers for semantic and visual correspondence. arXiv preprint arXiv:2403.11120 (2024)

work page arXiv 2024
[10]

Hong, S., Kim, S.: Deep matching prior: Test-time optimization for dense corre- spondence.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputer vision. pp. 9907–9917 (2021) 16 M. Gröpl et al

2021
[11]

Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022)

Hong, S., Nam, J., Cho, S., Hong, S., Jeon, S., Min, D., Kim, S.: Neural matching fields: Implicit representation of matching fields for visual correspondence. Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022)

2022
[12]

Hudson and Christopher D

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual rea- soning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6693–6702 (2019). https://doi.org/10.1109/CVPR.2019.00686

work page doi:10.1109/cvpr.2019.00686 2019
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Jiang, Y., Gu, J., Xue, T., Cheung, K.C., Molchanov, P., Yin, H., Liu, S.: Token- efficient vlm: High-resolution image understanding via dynamic region proposal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 24147–24158 (October 2025)

2025
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kaduri, O., Bagon, S., Dekel, T.: What’s in the image? a deep-dive into the vision of vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14549–14558 (2025)

2025
[15]

Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding (2025),https://arxiv.org/abs/ 2503.06287

work page arXiv 2025
[16]

arXiv preprint arXiv:2509.18096 , year=

Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers. arXiv preprint arXiv:2509.18096 (2025)

work page arXiv 2025
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9579–9589 (June 2024)

2024
[18]

Transactions on Machine Learning Research (2024)

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research (2024)

2024
[19]

In: The 2023 Conference on Empir- ical Methods in Natural Language Processing (2023),https://openreview.net/ forum?id=xozJw0kZXF

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: The 2023 Conference on Empir- ical Methods in Natural Language Processing (2023),https://openreview.net/ forum?id=xozJw0kZXF

2023
[20]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024),https://arxiv.org/abs/2310.03744

work page internal anchor Pith review arXiv 2024
[21]

io/blog/2024-01-30-llava-next/

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

2024
[22]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection (2024),https://arxiv.org/abs/2303.05499

work page Pith review arXiv 2024
[23]

Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa (2021),https://arxiv.org/abs/2104.12756

work page arXiv 2021
[24]

Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images (2021),https://arxiv.org/abs/2007.00398

work page arXiv 2021
[25]

Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2016),https://arxiv.org/abs/1506.02640

work page arXiv 2016
[26]

kneedle

Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a" kneedle" in a haystack: Detecting knee points in system behavior. In: 2011 31st international conference on distributed computing systems workshops. pp. 166–171. IEEE (2011)

2011
[27]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017) Entropy-Gradient Grounding 17

2017
[28]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: ZoomEye: Enhancing multimodal LLMs with human-like zooming capabilities through tree- based image exploration. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. pp. 6602–6618. Ass...

work page doi:10.18653/v1/2025.emnlp- 2025
[29]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8317–8326 (2019)

2019
[30]

In: Precup, D., Teh, Y.W

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Confer- ence on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR (06–11 Aug 2017),https://proceedings.mlr.press/v70/ sundararajan17a.html

2017
[31]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms (2023),https://arxiv.org/abs/2312.14135

work page arXiv 2023
[33]

What’s in the image? a deep-dive into the vision of vision language models

Wu, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy, C.C.: F-lmm: Grounding frozen large multimodal models. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24710–24721 (2025).https://doi. org/10.1109/CVPR52734.2025.02301

work page doi:10.1109/cvpr52734.2025.02301 2025
[34]

xAI: Realworldqa dataset (2024),https://huggingface.co/datasets/xai-org/ RealworldQA, dataset card (accessed 2026-02-28)

2024
[35]

In: ICCV (2025)

Xie, Y., Yang, K., An, X., Wu, K., Zhao, Y., Deng, W., Ran, Z., Wang, Y., Feng, Z., Miles, R., Elezi, I., Deng, J.: Region-based cluster discrimination for visual representation learning. In: ICCV (2025)

2025
[36]

arXiv preprint arXiv:2509.07979 (2025) 4

Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025)

work page arXiv 2025
[37]

Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: Mllms know where to look: Training-free perception of small visual details with multimodal llms (2025), https://arxiv.org/abs/2502.17422

work page arXiv 2025
[38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, Z., Yadav, S., Han, F., Shutova, E.: Cross-modal information flow in multi- modal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19781–19791 (2025) Entropy-Gradient Grounding 1 Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models –Appendix– Table of Contents A ...

2025