arxiv: 2404.12390 · v4 · submitted 2024-04-18 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu , Yushi Hu , Bangzheng Li , Yu Feng , Haoyu Wang , Xudong Lin , Dan Roth , Noah A. Smith

show 2 more authors

Wei-Chiu Ma Ranjay Krishna

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multimodal LLMsvisual perceptionBLINK benchmarkcomputer vision tasksdepth estimationvisual correspondenceGPT-4VGemini

0 comments

The pith

Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BLINK introduces a benchmark of 3,807 multiple-choice questions drawn from 14 classic computer vision tasks, such as relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning. These tasks can be solved by humans within a blink yet resist mediation through natural language, so they test whether multimodal LLMs have developed genuine visual perception. The paper reports that even the strongest models, GPT-4V and Gemini, score 51.26% and 45.72%, only modestly above random guessing, while humans average 95.70%. Specialist computer vision models solve the same problems far more reliably. The results indicate that core perceptual abilities have not yet emerged in recent multimodal LLMs.

Core claim

Multimodal large language models can process images but lack core visual perception abilities; on the BLINK benchmark they achieve at most 51.26% accuracy, only 13 percentage points above random guessing, whereas humans reach 95.70%.

What carries the argument

The BLINK benchmark, which reformats classic computer vision tasks into image-paired multiple-choice questions that cannot be solved reliably through language patterns alone.

If this is right

Multimodal LLMs will need mechanisms beyond current scaling to acquire reliable visual perception.
Integration with specialist computer vision models offers a direct route to higher accuracy on these tasks.
Future training objectives should prioritize tasks that resist solution by text-only statistical patterns.
Benchmarks focused on perception will be required to measure progress toward human-level visual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap may reflect training priorities that favor language alignment over precise visual feature extraction.
Similar limitations are likely to appear in related areas such as video understanding and spatial reasoning.
Explicit perceptual modules drawn from traditional computer vision could be combined with LLMs to close the gap more quickly than scaling alone.

Load-bearing premise

The selected tasks genuinely require visual perception that cannot be solved through language patterns or statistical shortcuts in the training data.

What would settle it

A multimodal LLM that reaches 90% or higher accuracy on the full BLINK set after targeted training or architectural changes, without relying on external CV modules, would falsify the claim.

read the original abstract

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BLINK shows a measurable gap on perception tasks for top MLLMs, but the design leaves room for language-only shortcuts that need tighter checks.

read the letter

The core finding is straightforward: on these 14 reformatted CV tasks, humans score 96% while GPT-4V and Gemini reach only 51% and 46%, barely above random. The paper turns classic problems like depth ordering and correspondence into 3807 multiple-choice questions and reports the numbers cleanly. That empirical gap is the useful part, and the side-by-side with specialist CV models doing better gives a concrete hint about where the shortfall sits. The benchmark itself is a direct addition to the set of tests people already use for multimodal models. It is easy to understand what each task is asking and why it should require seeing the image. The main soft spot is the missing isolation of the visual component. The claim that the tasks resist language mediation is stated, but the abstract and setup do not show explicit text-only or image-ablated runs. Without those, some of the low scores could trace to option bias or training-data correlations rather than absent perception. That is a moderate concern rather than a fatal one, but it is the first thing a referee would ask for. The work is aimed at groups building or evaluating multimodal systems who need a focused probe for basic visual skills. The result is grounded enough in the reported numbers to deserve peer review; the benchmark can be stress-tested and extended once the controls are verified.

Referee Report

2 major / 2 minor

Summary. The paper introduces BLINK, a benchmark of 3,807 multiple-choice questions reformatted from 14 classic computer vision tasks (e.g., relative depth estimation, visual correspondence, forensics detection) that are claimed to require core visual perception abilities resistant to natural-language mediation. Humans achieve 95.70% accuracy on average, while leading MLLMs such as GPT-4V and Gemini reach only 51.26% and 45.72% (13.17% and 7.63% above random), and specialist CV models perform substantially better.

Significance. If the tasks genuinely isolate visual perception, the work would be significant by documenting a large, reproducible gap between human and current MLLM performance on perception-heavy CV problems and by supplying a concrete benchmark that could guide integration of specialist CV techniques into multimodal models.

major comments (2)

[Benchmark Design] The headline claim that perception abilities 'have not emerged' rests on the assertion that the 14 tasks 'resist mediation through natural language,' yet the manuscript reports no text-only baselines, blank-image ablations, or option-bias controls. Without these, low MLLM accuracies could arise from prompt sensitivity or memorized correlations rather than absent visual perception (see skeptic note and abstract).
[Results] Table or results section: the exact random-guessing baseline is not derived or tabulated per task (e.g., 4-option vs. 2-option questions), making the stated margins of 13.17% and 7.63% above chance impossible to verify independently.

minor comments (2)

[Methods] Add a short paragraph in the methods describing image sourcing, exclusion criteria, and inter-annotator agreement for the 3,807 questions.
[Figures] Figure captions should explicitly state the number of images per question and the prompting format used for each task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing the BLINK benchmark. We address each of the major comments point by point below. We agree with the need for additional controls and baselines to strengthen our claims and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Benchmark Design] The headline claim that perception abilities 'have not emerged' rests on the assertion that the 14 tasks 'resist mediation through natural language,' yet the manuscript reports no text-only baselines, blank-image ablations, or option-bias controls. Without these, low MLLM accuracies could arise from prompt sensitivity or memorized correlations rather than absent visual perception (see skeptic note and abstract).

Authors: We recognize the importance of these controls to substantiate that the tasks primarily require visual perception rather than language-based reasoning. In the revised version, we will include text-only baselines by evaluating the MLLMs on the textual questions without any images. We will also perform blank-image ablations and analyze option biases by shuffling choices. These additions will allow us to better isolate the visual component and address potential confounds. revision: yes
Referee: [Results] Table or results section: the exact random-guessing baseline is not derived or tabulated per task (e.g., 4-option vs. 2-option questions), making the stated margins of 13.17% and 7.63% above chance impossible to verify independently.

Authors: We apologize for the omission and agree that providing per-task random baselines is essential for independent verification. We will update the results section and tables to explicitly list the number of choices for each task and compute the corresponding random accuracy (25% for four options, 50% for two, etc.). The reported margins above random will be recalculated based on the weighted average across tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent controls

full rationale

The paper introduces BLINK as a new benchmark by reformatting 14 existing CV tasks into multiple-choice questions and reports model accuracies against human performance and random guessing baselines. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text. The central claim rests on direct empirical comparison on held-out questions, with no reduction of results to inputs by construction. Specialist CV model comparisons further support external grounding rather than internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations. It rests on the domain assumption that the chosen tasks isolate visual perception independent of language.

axioms (1)

domain assumption The selected computer vision tasks measure core perceptual abilities that cannot be solved through language mediation alone
Invoked in the abstract when stating that the tasks resist mediation through natural language.

pith-pipeline@v0.9.0 · 5545 in / 1159 out tokens · 19462 ms · 2026-05-15T20:14:26.616449+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/LogicAsFunctionalEquation laws_of_logic_imply_dalembert_hypotheses echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
cs.CV 2026-05 unverdicted novelty 7.0

MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
cs.CV 2026-04 unverdicted novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
cs.LG 2026-05 unverdicted novelty 6.0

Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
cs.LG 2026-04 unverdicted novelty 6.0

RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
Multimodal Language Models Cannot Spot Spatial Inconsistencies
cs.CV 2026-04 unverdicted novelty 6.0

Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Depth Anything V2
cs.CV 2024-06 unverdicted novelty 6.0

Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 18 Pith papers · 16 internal anchors

[1]

Introducing the next generation of claude.https://www.anthropic.com/news/ claude-3-family (March 2024) 11, 12, 23, 24

work page 2024
[2]

In: AAAI (2019) 10

Acharya, M., Kafle, K., Kanan, C.: Tallyqa: Answering complex counting questions. In: AAAI (2019) 10

work page 2019
[3]

Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22

work page 2022
[4]

In: Proceedings of the IEEE international conference on computer vision

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015) 4

work page 2015
[5]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023) 11, 12, 22, 24

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023) 2

work page 2023
[7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 4, 8, 11, 12, 23, 24 16 Fu et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

In: CVPR (2017) 3, 7

Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017) 3, 7

work page 2017
[9]

Barrow, H., Tenenbaum, J., Hanson, A., Riseman, E.: Recovering intrinsic scene characteristics. Comput. vis. syst2(3-26), 2 (1978) 2

work page 1978
[10]

ACM Trans

Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM Trans. on Graphics (SIGGRAPH) 33(4) (2014) 3, 7

work page 2014
[11]

arXiv preprint arXiv:2306.16410 (2023) 2

Berrios, W., Mittal, G., Thrush, T., Kiela, D., Singh, A.: Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410 (2023) 2

work page arXiv 2023
[12]

In: 1993 (4th) International Conference on Computer Vision

Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision. pp. 231–236. IEEE (1993) 2

work page 1993
[13]

Advances in neural information processing systems33, 1877–1901 (2020) 4, 9, 11, 21, 22

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 4, 9, 11, 21, 22

work page 1901
[14]

ACM Trans

Careaga, C., Aksoy, Y.: Intrinsic image decomposition via ordinal shading. ACM Trans. Graph. (2023) 14

work page 2023
[15]

In: CVPR (2021) 4

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021) 4

work page 2021
[16]

Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning (2023) 4, 11, 12, 24

work page 2023
[17]

arXiv preprint arXiv:2310.09478 (2023) 22

Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023) 22

work page arXiv 2023
[18]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Advances in neural information processing systems29 (2016) 3, 7, 8

Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. Advances in neural information processing systems29 (2016) 3, 7, 8

work page 2016
[20]

Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., Shakeri, S., Dehghani, M., Salz, D., Lucic, M., Tschannen, M., Nagrani, A., Hu, H., Joshi, M., Pang, B., Montgomery, C., Pietrzyk, P., Ritter, M., Piergiovanni, A., Minderer, M., Pavetic, F., Waters, A., Li, G., Alabdulmohsin, I., Beyer,...

work page 2023
[21]

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskay...

work page 2022
[22]

https://github.com/open-compass/opencompass (2023) 11

Contributors, O.: Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass (2023) 11

work page 2023
[23]

com/InternLM/xtuner (2023) 11, 12, 23, 24

Contributors, X.: Xtuner: A toolkit for efficiently fine-tuning llm.https://github. com/InternLM/xtuner (2023) 11, 12, 23, 24

work page 2023
[24]

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4, 11, 12, 23, 24

work page 2023
[25]

DO CT OR OF, P.E.: MACHINE PERCEPTION OF THREE-DIMENSIONAL, SO LIDS. Ph.D. thesis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY (1961) 2

work page 1961
[26]

arXiv preprint arXiv:2401.16420 (2024) 2, 11

Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., Zhang, W., Li, Y., Yan, H., Gao, Y., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., Wang, J.: Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv...

work page arXiv 2024
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19358–19369 (2023) 4, 22

work page 2023
[28]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data (2023) 3, 10

work page 2023
[30]

In: Rogers, A., Boyd-Graber, J., Okazaki, N

Fu, X., Zhang, S., Kwon, G., Perera, P., Zhu, H., Zhang, Y., Li, A.H., Wang, W.Y., Wang, Z., Castelli, V., Ng, P., Roth, D., Xiang, B.: Generate then select: Open-ended visual question answering guided by world knowledge. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023. pp. 2333–2346....

work page doi:10.18653/v1/2023.findings- 2023
[31]

In: Muresan, S., Nakov, P., Villavicencio, A

Fu, X., Zhou, B., Chandratreya, I., Vondrick, C., Roth, D.: There’s a time and place for reasoning beyond the image. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1138–1149. Association for Computational Linguistics, Dublin, Ireland ...

work page doi:10.18653/v1/2022 2022
[32]

arXiv preprint arXiv:2305.14882 (2023) 4

Fu, X., Zhou, B., Chen, S., Yatskar, M., Roth, D.: Interpretable by design visual question answering. arXiv preprint arXiv:2305.14882 (2023) 4

work page arXiv 2023
[33]

In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 4

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 4

work page 2017
[34]

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models (2023) 5

work page 2023
[35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5356–5364 (2019) 3, 9 18 Fu et al

work page 2019
[36]

In: Alvey vision conference

Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey vision conference. vol. 15, pp. 10–5244. Citeseer (1988) 2

work page 1988
[37]

Cambridge university press (2003) 2

Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003) 2

work page 2003
[38]

arXiv preprint arXiv:2211.09699 (2022) 2, 4

Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022) 2, 4

work page arXiv 2022
[39]

arXiv preprint arXiv:2303.11897 (2023) 4

Hu, Y., Liu, B., Kasai, J., Wang, Y., Ostendorf, M., Krishna, R., Smith, N.A.: Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897 (2023) 4

work page arXiv 2023
[40]

arXiv preprint arXiv:2312.03052 (2023) 14

Hu, Y., Stretcu, O., Lu, C.T., Viswanathan, K., Hata, K., Luo, E., Krishna, R., Fuxman, A.: Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. arXiv preprint arXiv:2312.03052 (2023) 14

work page arXiv 2023
[41]

International journal of computer vision123, 32–73 (2017) 3, 4

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123, 32–73 (2017) 3, 4

work page 2017
[42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lai, Z., Purushwalkam, S., Gupta, A.: The functional correspondence problem. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15772–15781 (2021) 3, 10

work page 2021
[43]

arXiv preprint arXiv:2311.17092 (2023) 2, 5

Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023) 2, 5

work page arXiv 2023
[44]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023) 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 4, 23

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014) 3, 4, 5, 10

work page 2014
[47]

Transactions of the Association for Computational Linguistics11, 635–651 (2023) 2, 9, 21

Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics11, 635–651 (2023) 2, 9, 21

work page 2023
[48]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 4, 12, 23, 24

work page 2023
[49]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 11

work page 2023
[50]

io/blog/2024-01-30-llava-next/ 2, 4, 8, 11, 12, 23, 24

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/ 2, 4, 8, 11, 12, 23, 24

work page 2024
[51]

Advances in neural information processing systems36 (2024) 2, 11

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36 (2024) 2, 11

work page 2024
[52]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2023) 2, 3, 5, 9, 13, 21

work page 2023
[53]

arXiv preprint arXiv:2305.07895 (2023) 2

Liu, Y., Li, Z., Li, H., Yu, W., Huang, M., Peng, D., Liu, M., Chen, M., Li, C., Jin, L., et al.: On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895 (2023) 2

work page arXiv 2023
[54]

Liu, Y., Li, Z., Yang, B., Li, C., Yin, X., lin Liu, C., Jin, L., Bai, X.: On the hidden mystery of ocr in large multimodal models (2024) 4 Blink 19

work page 2024
[55]

In: Proceedings of the seventh IEEE international conference on computer vision

Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision. vol. 2, pp. 1150–1157. Ieee (1999) 2

work page 1999
[56]

arXiv preprint arXiv:2312.17172 (2023) 2

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172 (2023) 2

work page arXiv 2023
[57]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

MIT press (2010) 2

Marr, D.: Vision: A computational investigation into the human representation and processing of visual information. MIT press (2010) 2

work page 2010
[59]

Science 194(4262), 283–287 (1976) 2

Marr, D., Poggio, T.: Cooperative computation of stereo disparity: A cooperative algorithm is derived for extracting disparity information from stereo image pairs. Science 194(4262), 283–287 (1976) 2

work page 1976
[60]

arXiv prepreint arXiv:1908.10543 (2019) 10

Min, J., Lee, J., Ponce, J., Cho, M.: Spair-71k: A large-scale benchmark for semantic correspondence. arXiv prepreint arXiv:1908.10543 (2019) 10

work page arXiv 1908
[61]

Cambridge tiass., HIT479(480), 104 (1969) 2

Minsky, M., Papert, S.: An introduction to computational geometry. Cambridge tiass., HIT479(480), 104 (1969) 2

work page 1969
[62]

OpenAI: Gpt-4 technical report (2023) 2, 4, 5, 8, 11, 12, 23, 24, 25

work page 2023
[63]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 4, 22, 23

work page 2021
[65]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

In: European Conference on Computer Vision

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European Conference on Computer Vision. pp. 146–162. Springer (2022) 4

work page 2022
[67]

arXiv preprint arXiv:2304.06712 (2023) 14

Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 14

work page arXiv 2023
[68]

CVPR (2021) 14

Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: Detector-free local feature matching with transformers. CVPR (2021) 14

work page 2021
[69]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 4, 23

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

arXiv preprint arXiv:2306.03881 (2023) 14

Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881 (2023) 14

work page arXiv 2023
[71]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 2, 4, 8, 11, 12, 23, 24, 25

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

https://github.com/InternLM/InternLM (2023) 12, 23, 24

Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM (2023) 12, 23, 24

work page 2023
[73]

Team, M.N.: Introducing mpt-7b: A new standard for open-source, commercially usable llms (2023),www.mosaicml.com/blog/mpt-7b, accessed: 2023-05-05 22

work page 2023
[74]

IEEE Transactions on pattern analysis and machine intelligence24(9), 1226–1238 (2002) 2 20 Fu et al

Torralba, A., Oliva, A.: Depth estimation from image structure. IEEE Transactions on pattern analysis and machine intelligence24(9), 1226–1238 (2002) 2 20 Fu et al

work page 2002
[75]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 4, 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

In: Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition

Wang, J.Y., Adelson, E.H.: Layered representation for motion analysis. In: Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 361–366. IEEE (1993) 2

work page 1993
[77]

Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 11, 12, 23, 24

work page 2023
[78]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022) 26

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

arXiv preprint arXiv:2303.09295 (2023) 14

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection. arXiv preprint arXiv:2303.09295 (2023) 14

work page arXiv 2023
[80]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2022) 4

work page 2022

Showing first 80 references.