pith. machine review for the scientific record. sign in

arxiv: 2404.12390 · v4 · submitted 2024-04-18 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

BLINK: Multimodal Large Language Models Can See but Not Perceive

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodal LLMsvisual perceptionBLINK benchmarkcomputer vision tasksdepth estimationvisual correspondenceGPT-4VGemini
0
0 comments X

The pith

Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BLINK introduces a benchmark of 3,807 multiple-choice questions drawn from 14 classic computer vision tasks, such as relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning. These tasks can be solved by humans within a blink yet resist mediation through natural language, so they test whether multimodal LLMs have developed genuine visual perception. The paper reports that even the strongest models, GPT-4V and Gemini, score 51.26% and 45.72%, only modestly above random guessing, while humans average 95.70%. Specialist computer vision models solve the same problems far more reliably. The results indicate that core perceptual abilities have not yet emerged in recent multimodal LLMs.

Core claim

Multimodal large language models can process images but lack core visual perception abilities; on the BLINK benchmark they achieve at most 51.26% accuracy, only 13 percentage points above random guessing, whereas humans reach 95.70%.

What carries the argument

The BLINK benchmark, which reformats classic computer vision tasks into image-paired multiple-choice questions that cannot be solved reliably through language patterns alone.

If this is right

  • Multimodal LLMs will need mechanisms beyond current scaling to acquire reliable visual perception.
  • Integration with specialist computer vision models offers a direct route to higher accuracy on these tasks.
  • Future training objectives should prioritize tasks that resist solution by text-only statistical patterns.
  • Benchmarks focused on perception will be required to measure progress toward human-level visual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap may reflect training priorities that favor language alignment over precise visual feature extraction.
  • Similar limitations are likely to appear in related areas such as video understanding and spatial reasoning.
  • Explicit perceptual modules drawn from traditional computer vision could be combined with LLMs to close the gap more quickly than scaling alone.

Load-bearing premise

The selected tasks genuinely require visual perception that cannot be solved through language patterns or statistical shortcuts in the training data.

What would settle it

A multimodal LLM that reaches 90% or higher accuracy on the full BLINK set after targeted training or architectural changes, without relying on external CV modules, would falsify the claim.

read the original abstract

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BLINK, a benchmark of 3,807 multiple-choice questions reformatted from 14 classic computer vision tasks (e.g., relative depth estimation, visual correspondence, forensics detection) that are claimed to require core visual perception abilities resistant to natural-language mediation. Humans achieve 95.70% accuracy on average, while leading MLLMs such as GPT-4V and Gemini reach only 51.26% and 45.72% (13.17% and 7.63% above random), and specialist CV models perform substantially better.

Significance. If the tasks genuinely isolate visual perception, the work would be significant by documenting a large, reproducible gap between human and current MLLM performance on perception-heavy CV problems and by supplying a concrete benchmark that could guide integration of specialist CV techniques into multimodal models.

major comments (2)
  1. [Benchmark Design] The headline claim that perception abilities 'have not emerged' rests on the assertion that the 14 tasks 'resist mediation through natural language,' yet the manuscript reports no text-only baselines, blank-image ablations, or option-bias controls. Without these, low MLLM accuracies could arise from prompt sensitivity or memorized correlations rather than absent visual perception (see skeptic note and abstract).
  2. [Results] Table or results section: the exact random-guessing baseline is not derived or tabulated per task (e.g., 4-option vs. 2-option questions), making the stated margins of 13.17% and 7.63% above chance impossible to verify independently.
minor comments (2)
  1. [Methods] Add a short paragraph in the methods describing image sourcing, exclusion criteria, and inter-annotator agreement for the 3,807 questions.
  2. [Figures] Figure captions should explicitly state the number of images per question and the prompting format used for each task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing the BLINK benchmark. We address each of the major comments point by point below. We agree with the need for additional controls and baselines to strengthen our claims and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Benchmark Design] The headline claim that perception abilities 'have not emerged' rests on the assertion that the 14 tasks 'resist mediation through natural language,' yet the manuscript reports no text-only baselines, blank-image ablations, or option-bias controls. Without these, low MLLM accuracies could arise from prompt sensitivity or memorized correlations rather than absent visual perception (see skeptic note and abstract).

    Authors: We recognize the importance of these controls to substantiate that the tasks primarily require visual perception rather than language-based reasoning. In the revised version, we will include text-only baselines by evaluating the MLLMs on the textual questions without any images. We will also perform blank-image ablations and analyze option biases by shuffling choices. These additions will allow us to better isolate the visual component and address potential confounds. revision: yes

  2. Referee: [Results] Table or results section: the exact random-guessing baseline is not derived or tabulated per task (e.g., 4-option vs. 2-option questions), making the stated margins of 13.17% and 7.63% above chance impossible to verify independently.

    Authors: We apologize for the omission and agree that providing per-task random baselines is essential for independent verification. We will update the results section and tables to explicitly list the number of choices for each task and compute the corresponding random accuracy (25% for four options, 50% for two, etc.). The reported margins above random will be recalculated based on the weighted average across tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent controls

full rationale

The paper introduces BLINK as a new benchmark by reformatting 14 existing CV tasks into multiple-choice questions and reports model accuracies against human performance and random guessing baselines. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text. The central claim rests on direct empirical comparison on held-out questions, with no reduction of results to inputs by construction. Specialist CV model comparisons further support external grounding rather than internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations. It rests on the domain assumption that the chosen tasks isolate visual perception independent of language.

axioms (1)
  • domain assumption The selected computer vision tasks measure core perceptual abilities that cannot be solved through language mediation alone
    Invoked in the abstract when stating that the tasks resist mediation through natural language.

pith-pipeline@v0.9.0 · 5545 in / 1159 out tokens · 19462 ms · 2026-05-15T20:14:26.616449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

    cs.CV 2026-05 unverdicted novelty 7.0

    MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.

  3. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  4. EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

    cs.CV 2026-04 unverdicted novelty 7.0

    EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

  5. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  6. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  7. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  8. Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

  9. RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

    cs.LG 2026-04 unverdicted novelty 6.0

    RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...

  10. Multimodal Language Models Cannot Spot Spatial Inconsistencies

    cs.CV 2026-04 unverdicted novelty 6.0

    Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.

  11. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  12. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  15. Depth Anything V2

    cs.CV 2024-06 unverdicted novelty 6.0

    Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.

  16. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  17. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  18. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  19. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 18 Pith papers · 16 internal anchors

  1. [1]

    Introducing the next generation of claude.https://www.anthropic.com/news/ claude-3-family (March 2024) 11, 12, 23, 24

  2. [2]

    In: AAAI (2019) 10

    Acharya, M., Kafle, K., Kanan, C.: Tallyqa: Answering complex counting questions. In: AAAI (2019) 10

  3. [3]

    Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22

  4. [4]

    In: Proceedings of the IEEE international conference on computer vision

    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015) 4

  5. [5]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023) 11, 12, 22, 24

  6. [6]

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023) 2

  7. [7]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 4, 8, 11, 12, 23, 24 16 Fu et al

  8. [8]

    In: CVPR (2017) 3, 7

    Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017) 3, 7

  9. [9]

    Barrow, H., Tenenbaum, J., Hanson, A., Riseman, E.: Recovering intrinsic scene characteristics. Comput. vis. syst2(3-26), 2 (1978) 2

  10. [10]

    ACM Trans

    Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM Trans. on Graphics (SIGGRAPH) 33(4) (2014) 3, 7

  11. [11]

    arXiv preprint arXiv:2306.16410 (2023) 2

    Berrios, W., Mittal, G., Thrush, T., Kiela, D., Singh, A.: Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410 (2023) 2

  12. [12]

    In: 1993 (4th) International Conference on Computer Vision

    Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision. pp. 231–236. IEEE (1993) 2

  13. [13]

    Advances in neural information processing systems33, 1877–1901 (2020) 4, 9, 11, 21, 22

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 4, 9, 11, 21, 22

  14. [14]

    ACM Trans

    Careaga, C., Aksoy, Y.: Intrinsic image decomposition via ordinal shading. ACM Trans. Graph. (2023) 14

  15. [15]

    In: CVPR (2021) 4

    Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021) 4

  16. [16]

    Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning (2023) 4, 11, 12, 24

  17. [17]

    arXiv preprint arXiv:2310.09478 (2023) 22

    Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023) 22

  18. [18]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2

  19. [19]

    Advances in neural information processing systems29 (2016) 3, 7, 8

    Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. Advances in neural information processing systems29 (2016) 3, 7, 8

  20. [20]

    Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., Shakeri, S., Dehghani, M., Salz, D., Lucic, M., Tschannen, M., Nagrani, A., Hu, H., Joshi, M., Pang, B., Montgomery, C., Pietrzyk, P., Ritter, M., Piergiovanni, A., Minderer, M., Pavetic, F., Waters, A., Li, G., Alabdulmohsin, I., Beyer,...

  21. [21]

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskay...

  22. [22]

    https://github.com/open-compass/opencompass (2023) 11

    Contributors, O.: Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass (2023) 11

  23. [23]

    com/InternLM/xtuner (2023) 11, 12, 23, 24

    Contributors, X.: Xtuner: A toolkit for efficiently fine-tuning llm.https://github. com/InternLM/xtuner (2023) 11, 12, 23, 24

  24. [24]

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4, 11, 12, 23, 24

  25. [25]

    DO CT OR OF, P.E.: MACHINE PERCEPTION OF THREE-DIMENSIONAL, SO LIDS. Ph.D. thesis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY (1961) 2

  26. [26]

    arXiv preprint arXiv:2401.16420 (2024) 2, 11

    Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., Zhang, W., Li, Y., Yan, H., Gao, Y., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., Wang, J.: Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv...

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19358–19369 (2023) 4, 22

  28. [28]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 4

  29. [29]

    Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data (2023) 3, 10

  30. [30]

    In: Rogers, A., Boyd-Graber, J., Okazaki, N

    Fu, X., Zhang, S., Kwon, G., Perera, P., Zhu, H., Zhang, Y., Li, A.H., Wang, W.Y., Wang, Z., Castelli, V., Ng, P., Roth, D., Xiang, B.: Generate then select: Open-ended visual question answering guided by world knowledge. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023. pp. 2333–2346....

  31. [31]

    In: Muresan, S., Nakov, P., Villavicencio, A

    Fu, X., Zhou, B., Chandratreya, I., Vondrick, C., Roth, D.: There’s a time and place for reasoning beyond the image. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1138–1149. Association for Computational Linguistics, Dublin, Ireland ...

  32. [32]

    arXiv preprint arXiv:2305.14882 (2023) 4

    Fu, X., Zhou, B., Chen, S., Yatskar, M., Roth, D.: Interpretable by design visual question answering. arXiv preprint arXiv:2305.14882 (2023) 4

  33. [33]

    In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 4

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 4

  34. [34]

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models (2023) 5

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5356–5364 (2019) 3, 9 18 Fu et al

  36. [36]

    In: Alvey vision conference

    Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey vision conference. vol. 15, pp. 10–5244. Citeseer (1988) 2

  37. [37]

    Cambridge university press (2003) 2

    Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003) 2

  38. [38]

    arXiv preprint arXiv:2211.09699 (2022) 2, 4

    Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022) 2, 4

  39. [39]

    arXiv preprint arXiv:2303.11897 (2023) 4

    Hu, Y., Liu, B., Kasai, J., Wang, Y., Ostendorf, M., Krishna, R., Smith, N.A.: Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897 (2023) 4

  40. [40]

    arXiv preprint arXiv:2312.03052 (2023) 14

    Hu, Y., Stretcu, O., Lu, C.T., Viswanathan, K., Hata, K., Luo, E., Krishna, R., Fuxman, A.: Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. arXiv preprint arXiv:2312.03052 (2023) 14

  41. [41]

    International journal of computer vision123, 32–73 (2017) 3, 4

    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123, 32–73 (2017) 3, 4

  42. [42]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lai, Z., Purushwalkam, S., Gupta, A.: The functional correspondence problem. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15772–15781 (2021) 3, 10

  43. [43]

    arXiv preprint arXiv:2311.17092 (2023) 2, 5

    Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023) 2, 5

  44. [44]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023) 2, 3, 5

  45. [45]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 4, 23

  46. [46]

    In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014) 3, 4, 5, 10

  47. [47]

    Transactions of the Association for Computational Linguistics11, 635–651 (2023) 2, 9, 21

    Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics11, 635–651 (2023) 2, 9, 21

  48. [48]

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 4, 12, 23, 24

  49. [49]

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 11

  50. [50]

    io/blog/2024-01-30-llava-next/ 2, 4, 8, 11, 12, 23, 24

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/ 2, 4, 8, 11, 12, 23, 24

  51. [51]

    Advances in neural information processing systems36 (2024) 2, 11

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36 (2024) 2, 11

  52. [52]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2023) 2, 3, 5, 9, 13, 21

  53. [53]

    arXiv preprint arXiv:2305.07895 (2023) 2

    Liu, Y., Li, Z., Li, H., Yu, W., Huang, M., Peng, D., Liu, M., Chen, M., Li, C., Jin, L., et al.: On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895 (2023) 2

  54. [54]

    Liu, Y., Li, Z., Yang, B., Li, C., Yin, X., lin Liu, C., Jin, L., Bai, X.: On the hidden mystery of ocr in large multimodal models (2024) 4 Blink 19

  55. [55]

    In: Proceedings of the seventh IEEE international conference on computer vision

    Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision. vol. 2, pp. 1150–1157. Ieee (1999) 2

  56. [56]

    arXiv preprint arXiv:2312.17172 (2023) 2

    Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172 (2023) 2

  57. [57]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 2, 5

  58. [58]

    MIT press (2010) 2

    Marr, D.: Vision: A computational investigation into the human representation and processing of visual information. MIT press (2010) 2

  59. [59]

    Science 194(4262), 283–287 (1976) 2

    Marr, D., Poggio, T.: Cooperative computation of stereo disparity: A cooperative algorithm is derived for extracting disparity information from stereo image pairs. Science 194(4262), 283–287 (1976) 2

  60. [60]

    arXiv prepreint arXiv:1908.10543 (2019) 10

    Min, J., Lee, J., Ponce, J., Cho, M.: Spair-71k: A large-scale benchmark for semantic correspondence. arXiv prepreint arXiv:1908.10543 (2019) 10

  61. [61]

    Cambridge tiass., HIT479(480), 104 (1969) 2

    Minsky, M., Papert, S.: An introduction to computational geometry. Cambridge tiass., HIT479(480), 104 (1969) 2

  62. [62]

    OpenAI: Gpt-4 technical report (2023) 2, 4, 5, 8, 11, 12, 23, 24, 25

  63. [63]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 10

  64. [64]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 4, 22, 23

  65. [65]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 4

  66. [66]

    In: European Conference on Computer Vision

    Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European Conference on Computer Vision. pp. 146–162. Springer (2022) 4

  67. [67]

    arXiv preprint arXiv:2304.06712 (2023) 14

    Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 14

  68. [68]

    CVPR (2021) 14

    Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: Detector-free local feature matching with transformers. CVPR (2021) 14

  69. [69]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 4, 23

  70. [70]

    arXiv preprint arXiv:2306.03881 (2023) 14

    Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881 (2023) 14

  71. [71]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 2, 4, 8, 11, 12, 23, 24, 25

  72. [72]

    https://github.com/InternLM/InternLM (2023) 12, 23, 24

    Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM (2023) 12, 23, 24

  73. [73]

    Team, M.N.: Introducing mpt-7b: A new standard for open-source, commercially usable llms (2023),www.mosaicml.com/blog/mpt-7b, accessed: 2023-05-05 22

  74. [74]

    IEEE Transactions on pattern analysis and machine intelligence24(9), 1226–1238 (2002) 2 20 Fu et al

    Torralba, A., Oliva, A.: Depth estimation from image structure. IEEE Transactions on pattern analysis and machine intelligence24(9), 1226–1238 (2002) 2 20 Fu et al

  75. [75]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 4, 22

  76. [76]

    In: Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition

    Wang, J.Y., Adelson, E.H.: Layered representation for motion analysis. In: Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 361–366. IEEE (1993) 2

  77. [77]

    Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 11, 12, 23, 24

  78. [78]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022) 26

  79. [79]

    arXiv preprint arXiv:2303.09295 (2023) 14

    Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection. arXiv preprint arXiv:2303.09295 (2023) 14

  80. [80]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2022) 4

Showing first 80 references.