pith. machine review for the scientific record. sign in

arxiv: 2604.08456 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.CL

Recognition: unknown

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:11 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords entropy-gradient groundingvision-language modelstraining-free groundingevidence retrievaluncertainty supervisiontest-time groundingrelevance mapmulti-evidence queries
0
0 comments X

The pith

Vision-language models ground answers to visual details by backpropagating next-token entropy without training or external detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that uncertainty in a VLM's next-token predictions already encodes where the model needs to look in an image to resolve a query. By computing the entropy of the output distribution and backpropagating it to the visual token embeddings, the method produces a relevance map that highlights supporting regions. This map is used to extract and rank multiple coherent evidence areas for queries that span several parts of an image or document. An iterative zoom procedure refines the map while a spatial-entropy rule decides when to stop. The approach is tested on seven benchmarks spanning four model families and shows gains especially when answers depend on fine details or high-resolution inputs.

Core claim

Grounding is reframed as test-time evidence retrieval in which the entropy of the model's next-token distribution serves as an intrinsic supervision signal. Backpropagating this entropy through the visual embeddings yields a relevance map that identifies image regions needed to reduce predictive uncertainty, allowing extraction of ranked multi-region evidence without attention heuristics or auxiliary models. An iterative zoom-and-reground loop with a spatial-entropy stopping criterion prevents over-refinement while preserving necessary detail.

What carries the argument

The entropy-gradient relevance map produced by backpropagating next-token entropy to visual token embeddings.

If this is right

  • Consistent accuracy gains appear on detail-critical and high-resolution tasks across seven benchmarks and four VLM architectures.
  • Multiple coherent regions can be extracted and ranked to support compositional or multi-evidence queries.
  • The iterative zoom procedure with spatial-entropy stopping improves localization without requiring extra training.
  • Evidence maps are produced directly from model internals, increasing interpretability of the grounding process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy signal might be usable to decide when a VLM should request higher-resolution crops or additional views during inference.
  • Extending the backpropagation to other uncertainty measures, such as token-level variance, could yield complementary relevance maps.
  • The method offers a route to focus computation on relevant image patches, potentially lowering inference cost for high-resolution inputs.

Load-bearing premise

Backpropagating next-token entropy to visual embeddings produces a relevance map that reliably marks the image regions needed to answer the query.

What would settle it

On a benchmark of detail-critical queries, the entropy-gradient maps show no better alignment with human-annotated evidence regions than random or attention-based baselines, and answer accuracy does not improve.

Figures

Figures reproduced from arXiv: 2604.08456 by Jaewoo Jung, Marcel Gr\"opl, Marc Pollefeys, Seungryong Kim, Sunghwan Hong.

Figure 1
Figure 1. Figure 1: Teaser. As shown in (a) and (b), existing VLMs struggle to answer questions when visual evidence is fine-grained or exists in spatially disjoint regions. We propose a training-free method where we apply a query-based visual grounding method to discover relevant regions and provide these regions as additional image crops, improving performance in both challenging scenarios. Abstract. Despite rapid progress,… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of proposed method. (a) Given an image and prompt, we obtain an initial region-of-interest by backpropagating the entropy of the next-token distribu￾tion to visual embeddings, producing an entropy-gradient relevance map. (b) We iter￾atively re-ground and re-crop the most informative regions, stopping when the spatial entropy criterion indicates further refinement no longer improves spatial concent… view at source ↗
Figure 3
Figure 3. Figure 3: Relative attention vs. entropy-gradient. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual grounding via entropy-gradient map. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples of iterative and multi-region grounding. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of entropy-based gradient maps across different VLMs. We observe that weaker models (e.g., LLaVA 1.5) more frequently allocate gradients to spatially misaligned regions, whereas stronger models produce better-localized at￾tributions. Among the compared models, LLaVA 1.6 generally yields the cleanest and most coherent gradient maps. 4.2 Experimental Results Tab. 1 reports results on f… view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise maximum average gradient magnitude on TextVQA, illustrating how gradient strength evolves with model depth. Subsequent-token loss backpropagation. We study whether gradients from later decoding steps provide a better grounding signal than those from the first step. Tab. 6 reports results when backpropagating the loss from the t-th gener￾ated token (t ∈ {1, 2, 3, 4}) while keeping all other compo… view at source ↗
Figure 8
Figure 8. Figure 8: Example of gradient computation at different token positions. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of gradient-based grounding methods on [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional Examples. Qualitative examples on Llava 1.5. The most impor￾tant final crop is highlighted in red [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional Examples. Qualitative examples on Llava 1.6. The most impor￾tant final crop is highlighted in red. What is the color of the dustpan? (A) Purple (B) red (C) Blue (D) white D C LLaVA 1.6 LLaVA 1.6 w/ours What is the color of the shovel? (A) yellow (B) red (C) Blue (D) black D C LLaVA 1.6 LLaVA 1.6 w/ours What is the color of the SUV car? (A) black (B) red (C) silver (D) blue D C LLaVA 1.6 LLaVA 1… view at source ↗
Figure 12
Figure 12. Figure 12: Additional Examples. Qualitative examples on native resolution encod￾ings (QWEN 2.5, InternVL 3.5). The most important final crop is highlighted in red [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Failure cases. Qualitative examples of failing to predict the correct answer illustrating a limitation of our method. Even when confronted with a very detailed crop, the model still fails to answer correctly [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
read the original abstract

Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Entropy-Gradient Grounding, a training-free, model-intrinsic method for evidence retrieval in vision-language models. It computes the entropy of the next-token distribution and backpropagates this to visual token embeddings to produce a relevance map, extracts and ranks coherent regions for multi-evidence queries, and applies an iterative zoom-and-reground procedure with a spatial-entropy stopping rule. Experiments report consistent gains over baselines on seven benchmarks across four VLM architectures, with largest improvements in detail-critical and high-resolution settings.

Significance. If the central claims hold, the approach provides a practical, training-free alternative to attention heuristics or auxiliary detectors for improving VLM grounding on compositional and detail-dependent tasks. The model-intrinsic use of predictive entropy as supervision and the iterative procedure are strengths that could enhance interpretability without retraining.

major comments (3)
  1. [Experiments] Experiments (no section number given; results paragraph): The abstract and results claim consistent improvements on seven benchmarks, but no data splits, error bars, full ablation tables, or statistical significance tests are described. This makes it impossible to verify whether the reported gains are robust or attributable to the entropy-gradient procedure rather than implementation details.
  2. [Method] Method (gradient definition): The relevance map is obtained by backpropagating next-token entropy H(p(·|query, image tokens)) to visual token embeddings after the vision encoder. No controlled ablation (e.g., zeroing language-side gradients or comparing to attention rollout) is provided to demonstrate that the map isolates query-specific visual evidence rather than global uncertainty or cross-modal artifacts.
  3. [Iterative zoom-and-reground] Iterative procedure (spatial-entropy stopping rule): The stopping criterion based on spatial entropy is introduced to avoid over-refinement, but no analysis or ablation shows it reliably halts when key local evidence remains unexamined versus when global entropy plateaus due to non-visual factors.
minor comments (2)
  1. [Method] Notation for the entropy-gradient map and region extraction steps could be formalized with equations for reproducibility.
  2. [Abstract] The abstract mentions 'four VLM architectures' but does not list them explicitly; this should be stated in the experiments section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional details and analyses can strengthen the manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: The abstract and results claim consistent improvements on seven benchmarks, but no data splits, error bars, full ablation tables, or statistical significance tests are described. This makes it impossible to verify whether the reported gains are robust or attributable to the entropy-gradient procedure rather than implementation details.

    Authors: We agree that the current manuscript lacks sufficient experimental protocol details. In the revision, we will explicitly describe the data splits for each benchmark, report error bars as standard deviations over multiple random seeds, expand all ablations into complete tables, and include statistical significance tests (paired t-tests with p-values) comparing our method against baselines. These changes will allow verification that gains are attributable to the proposed procedure. revision: yes

  2. Referee: The relevance map is obtained by backpropagating next-token entropy H(p(·|query, image tokens)) to visual token embeddings after the vision encoder. No controlled ablation (e.g., zeroing language-side gradients or comparing to attention rollout) is provided to demonstrate that the map isolates query-specific visual evidence rather than global uncertainty or cross-modal artifacts.

    Authors: While cross-architecture consistency and task-specific gains provide indirect support for query-specific isolation, we acknowledge the value of direct controls. The revision will add ablations that zero language-side gradients during backpropagation and compare entropy-gradient maps against attention rollout and random baselines. These will quantify how much the maps depend on query-visual interactions versus global factors. revision: yes

  3. Referee: The stopping criterion based on spatial entropy is introduced to avoid over-refinement, but no analysis or ablation shows it reliably halts when key local evidence remains unexamined versus when global entropy plateaus due to non-visual factors.

    Authors: The spatial-entropy rule is motivated by monitoring local uncertainty reduction, but we agree that targeted validation is needed. The revision will include an ablation comparing performance with and without the stopping rule, plus qualitative case studies showing halt points on queries with remaining unexamined evidence. This will demonstrate that the criterion stops appropriately rather than due to non-visual plateaus. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a direct computational procedure without self-referential definitions or fitted predictions

full rationale

The paper defines the entropy-gradient relevance map explicitly as the gradient of next-token entropy H(p(·|query, image tokens)) with respect to visual token embeddings after the vision encoder. This is a first-principles computation on the pretrained VLM forward pass and does not reduce to any fitted parameter, self-definition, or prior result by the same authors. No equations or sections invoke self-citations as load-bearing justification, no ansatz is smuggled, and no known empirical pattern is merely renamed. The iterative zoom-and-reground procedure with spatial-entropy stopping rule is likewise a stated algorithmic rule, not a tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that next-token entropy serves as a useful supervision signal for visual grounding; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Entropy of the next-token distribution reflects visual ambiguity that can be resolved by attending to specific image regions.
    Invoked when using entropy backpropagation to generate the relevance map for evidence retrieval.

pith-pipeline@v0.9.0 · 5485 in / 1201 out tokens · 63239 ms · 2026-05-10T17:11:12.628777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 22 canonical work pages · 8 internal anchors

  1. [1]

    C3G: Learning Compact 3D Representations with 2K Gaussians

    An, H., Jung, J., Kim, M., Hong, S., Kim, C., Fukuda, K., Jeon, M., Han, J., Narihira, T., Ko, H., et al.: C3g: Learning compact 3d representations with 2k gaussians. arXiv preprint arXiv:2512.04021 (2025)

  2. [2]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    An, H., Kim, J.H., Park, S., Jung, J., Han, J., Hong, S., Kim, S.: Cross-view completion models are zero-shot correspondence estimators. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1103–1115 (2025)

  3. [3]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., Wu, C., Tan, H., Li, C., Yang, J., Yu, J., Wang, X., Qin, B., Wang, Y., Yan, Z., Feng, Z., Liu, Z., Li, B., Deng, J.: Llava-onevision-1.5: Fully open framework for democratized multimodal training (2025),https://arxiv. org/abs/2509.23661

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  5. [5]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  6. [6]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hen- drix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, ...

  7. [7]

    ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

    Gurbuz, A.S., Hong, S., Nassar, A., Pollefeys, M., Staar, P.: Moving be- yond sparse grounding with complete screen parsing supervision. arXiv preprint arXiv:2602.14276 (2026)

  8. [8]

    Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

    Han, J., Hong, S., Jung, J., Jang, W., An, H., Wang, Q., Kim, S., Feng, C.: Emergent outlier view rejection in visual geometry grounded transformers. arXiv preprint arXiv:2512.04012 (2025)

  9. [9]

    arXiv preprint arXiv:2403.11120 (2024) 3

    Hong, S., Cho, S., Kim, S., Lin, S.: Unifying feature and cost aggrega- tion with transformers for semantic and visual correspondence. arXiv preprint arXiv:2403.11120 (2024)

  10. [10]

    Hong, S., Kim, S.: Deep matching prior: Test-time optimization for dense corre- spondence.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputer vision. pp. 9907–9917 (2021) 16 M. Gröpl et al

  11. [11]

    Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022)

    Hong, S., Nam, J., Cho, S., Hong, S., Jeon, S., Min, D., Kim, S.: Neural matching fields: Implicit representation of matching fields for visual correspondence. Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022)

  12. [12]

    Hudson and Christopher D

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual rea- soning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6693–6702 (2019). https://doi.org/10.1109/CVPR.2019.00686

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Jiang, Y., Gu, J., Xue, T., Cheung, K.C., Molchanov, P., Yin, H., Liu, S.: Token- efficient vlm: High-resolution image understanding via dynamic region proposal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 24147–24158 (October 2025)

  14. [14]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kaduri, O., Bagon, S., Dekel, T.: What’s in the image? a deep-dive into the vision of vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14549–14558 (2025)

  15. [15]

    Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding (2025),https://arxiv.org/abs/ 2503.06287

  16. [16]

    arXiv preprint arXiv:2509.18096 , year=

    Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers. arXiv preprint arXiv:2509.18096 (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9579–9589 (June 2024)

  18. [18]

    Transactions on Machine Learning Research (2024)

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research (2024)

  19. [19]

    In: The 2023 Conference on Empir- ical Methods in Natural Language Processing (2023),https://openreview.net/ forum?id=xozJw0kZXF

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: The 2023 Conference on Empir- ical Methods in Natural Language Processing (2023),https://openreview.net/ forum?id=xozJw0kZXF

  20. [20]

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024),https://arxiv.org/abs/2310.03744

  21. [21]

    io/blog/2024-01-30-llava-next/

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

  22. [22]

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection (2024),https://arxiv.org/abs/2303.05499

  23. [23]

    Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa (2021),https://arxiv.org/abs/2104.12756

  24. [24]

    Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images (2021),https://arxiv.org/abs/2007.00398

  25. [25]

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2016),https://arxiv.org/abs/1506.02640

  26. [26]

    kneedle

    Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a" kneedle" in a haystack: Detecting knee points in system behavior. In: 2011 31st international conference on distributed computing systems workshops. pp. 166–171. IEEE (2011)

  27. [27]

    In: Proceedings of the IEEE international conference on computer vision

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017) Entropy-Gradient Grounding 17

  28. [28]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: ZoomEye: Enhancing multimodal LLMs with human-like zooming capabilities through tree- based image exploration. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. pp. 6602–6618. Ass...

  29. [29]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8317–8326 (2019)

  30. [30]

    In: Precup, D., Teh, Y.W

    Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Confer- ence on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR (06–11 Aug 2017),https://proceedings.mlr.press/v70/ sundararajan17a.html

  31. [31]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  32. [32]

    Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms (2023),https://arxiv.org/abs/2312.14135

  33. [33]

    What’s in the image? a deep-dive into the vision of vision language models

    Wu, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy, C.C.: F-lmm: Grounding frozen large multimodal models. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24710–24721 (2025).https://doi. org/10.1109/CVPR52734.2025.02301

  34. [34]

    xAI: Realworldqa dataset (2024),https://huggingface.co/datasets/xai-org/ RealworldQA, dataset card (accessed 2026-02-28)

  35. [35]

    In: ICCV (2025)

    Xie, Y., Yang, K., An, X., Wu, K., Zhao, Y., Deng, W., Ran, Z., Wang, Y., Feng, Z., Miles, R., Elezi, I., Deng, J.: Region-based cluster discrimination for visual representation learning. In: ICCV (2025)

  36. [36]

    arXiv preprint arXiv:2509.07979 (2025) 4

    Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025)

  37. [37]

    Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: Mllms know where to look: Training-free perception of small visual details with multimodal llms (2025), https://arxiv.org/abs/2502.17422

  38. [38]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, Z., Yadav, S., Han, F., Shutova, E.: Cross-modal information flow in multi- modal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19781–19791 (2025) Entropy-Gradient Grounding 1 Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models –Appendix– Table of Contents A ...