Recognition: 2 theorem links
· Lean TheoremBLINK: Multimodal Large Language Models Can See but Not Perceive
Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3
The pith
Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multimodal large language models can process images but lack core visual perception abilities; on the BLINK benchmark they achieve at most 51.26% accuracy, only 13 percentage points above random guessing, whereas humans reach 95.70%.
What carries the argument
The BLINK benchmark, which reformats classic computer vision tasks into image-paired multiple-choice questions that cannot be solved reliably through language patterns alone.
If this is right
- Multimodal LLMs will need mechanisms beyond current scaling to acquire reliable visual perception.
- Integration with specialist computer vision models offers a direct route to higher accuracy on these tasks.
- Future training objectives should prioritize tasks that resist solution by text-only statistical patterns.
- Benchmarks focused on perception will be required to measure progress toward human-level visual understanding.
Where Pith is reading between the lines
- The gap may reflect training priorities that favor language alignment over precise visual feature extraction.
- Similar limitations are likely to appear in related areas such as video understanding and spatial reasoning.
- Explicit perceptual modules drawn from traditional computer vision could be combined with LLMs to close the gap more quickly than scaling alone.
Load-bearing premise
The selected tasks genuinely require visual perception that cannot be solved through language patterns or statistical shortcuts in the training data.
What would settle it
A multimodal LLM that reaches 90% or higher accuracy on the full BLINK set after targeted training or architectural changes, without relying on external CV modules, would falsify the claim.
read the original abstract
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BLINK, a benchmark of 3,807 multiple-choice questions reformatted from 14 classic computer vision tasks (e.g., relative depth estimation, visual correspondence, forensics detection) that are claimed to require core visual perception abilities resistant to natural-language mediation. Humans achieve 95.70% accuracy on average, while leading MLLMs such as GPT-4V and Gemini reach only 51.26% and 45.72% (13.17% and 7.63% above random), and specialist CV models perform substantially better.
Significance. If the tasks genuinely isolate visual perception, the work would be significant by documenting a large, reproducible gap between human and current MLLM performance on perception-heavy CV problems and by supplying a concrete benchmark that could guide integration of specialist CV techniques into multimodal models.
major comments (2)
- [Benchmark Design] The headline claim that perception abilities 'have not emerged' rests on the assertion that the 14 tasks 'resist mediation through natural language,' yet the manuscript reports no text-only baselines, blank-image ablations, or option-bias controls. Without these, low MLLM accuracies could arise from prompt sensitivity or memorized correlations rather than absent visual perception (see skeptic note and abstract).
- [Results] Table or results section: the exact random-guessing baseline is not derived or tabulated per task (e.g., 4-option vs. 2-option questions), making the stated margins of 13.17% and 7.63% above chance impossible to verify independently.
minor comments (2)
- [Methods] Add a short paragraph in the methods describing image sourcing, exclusion criteria, and inter-annotator agreement for the 3,807 questions.
- [Figures] Figure captions should explicitly state the number of images per question and the prompting format used for each task.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript introducing the BLINK benchmark. We address each of the major comments point by point below. We agree with the need for additional controls and baselines to strengthen our claims and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Benchmark Design] The headline claim that perception abilities 'have not emerged' rests on the assertion that the 14 tasks 'resist mediation through natural language,' yet the manuscript reports no text-only baselines, blank-image ablations, or option-bias controls. Without these, low MLLM accuracies could arise from prompt sensitivity or memorized correlations rather than absent visual perception (see skeptic note and abstract).
Authors: We recognize the importance of these controls to substantiate that the tasks primarily require visual perception rather than language-based reasoning. In the revised version, we will include text-only baselines by evaluating the MLLMs on the textual questions without any images. We will also perform blank-image ablations and analyze option biases by shuffling choices. These additions will allow us to better isolate the visual component and address potential confounds. revision: yes
-
Referee: [Results] Table or results section: the exact random-guessing baseline is not derived or tabulated per task (e.g., 4-option vs. 2-option questions), making the stated margins of 13.17% and 7.63% above chance impossible to verify independently.
Authors: We apologize for the omission and agree that providing per-task random baselines is essential for independent verification. We will update the results section and tables to explicitly list the number of choices for each task and compute the corresponding random accuracy (25% for four options, 50% for two, etc.). The reported margins above random will be recalculated based on the weighted average across tasks. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with independent controls
full rationale
The paper introduces BLINK as a new benchmark by reformatting 14 existing CV tasks into multiple-choice questions and reports model accuracies against human performance and random guessing baselines. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text. The central claim rests on direct empirical comparison on held-out questions, with no reduction of results to inputs by construction. Specialist CV model comparisons further support external grounding rather than internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected computer vision tasks measure core perceptual abilities that cannot be solved through language mediation alone
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquationlaws_of_logic_imply_dalembert_hypotheses echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
-
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Depth Anything V2
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Reference graph
Works this paper leans on
-
[1]
Introducing the next generation of claude.https://www.anthropic.com/news/ claude-3-family (March 2024) 11, 12, 23, 24
work page 2024
-
[2]
Acharya, M., Kafle, K., Kanan, C.: Tallyqa: Answering complex counting questions. In: AAAI (2019) 10
work page 2019
-
[3]
Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22
work page 2022
-
[4]
In: Proceedings of the IEEE international conference on computer vision
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015) 4
work page 2015
-
[5]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023) 11, 12, 22, 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023) 2
work page 2023
-
[7]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 4, 8, 11, 12, 23, 24 16 Fu et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017) 3, 7
work page 2017
-
[9]
Barrow, H., Tenenbaum, J., Hanson, A., Riseman, E.: Recovering intrinsic scene characteristics. Comput. vis. syst2(3-26), 2 (1978) 2
work page 1978
- [10]
-
[11]
arXiv preprint arXiv:2306.16410 (2023) 2
Berrios, W., Mittal, G., Thrush, T., Kiela, D., Singh, A.: Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410 (2023) 2
-
[12]
In: 1993 (4th) International Conference on Computer Vision
Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision. pp. 231–236. IEEE (1993) 2
work page 1993
-
[13]
Advances in neural information processing systems33, 1877–1901 (2020) 4, 9, 11, 21, 22
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 4, 9, 11, 21, 22
work page 1901
- [14]
-
[15]
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021) 4
work page 2021
-
[16]
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning (2023) 4, 11, 12, 24
work page 2023
-
[17]
arXiv preprint arXiv:2310.09478 (2023) 22
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023) 22
-
[18]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Advances in neural information processing systems29 (2016) 3, 7, 8
Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. Advances in neural information processing systems29 (2016) 3, 7, 8
work page 2016
-
[20]
Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., Shakeri, S., Dehghani, M., Salz, D., Lucic, M., Tschannen, M., Nagrani, A., Hu, H., Joshi, M., Pang, B., Montgomery, C., Pietrzyk, P., Ritter, M., Piergiovanni, A., Minderer, M., Pavetic, F., Waters, A., Li, G., Alabdulmohsin, I., Beyer,...
work page 2023
-
[21]
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskay...
work page 2022
-
[22]
https://github.com/open-compass/opencompass (2023) 11
Contributors, O.: Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass (2023) 11
work page 2023
-
[23]
com/InternLM/xtuner (2023) 11, 12, 23, 24
Contributors, X.: Xtuner: A toolkit for efficiently fine-tuning llm.https://github. com/InternLM/xtuner (2023) 11, 12, 23, 24
work page 2023
-
[24]
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4, 11, 12, 23, 24
work page 2023
-
[25]
DO CT OR OF, P.E.: MACHINE PERCEPTION OF THREE-DIMENSIONAL, SO LIDS. Ph.D. thesis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY (1961) 2
work page 1961
-
[26]
arXiv preprint arXiv:2401.16420 (2024) 2, 11
Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., Zhang, W., Li, Y., Yan, H., Gao, Y., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., Wang, J.: Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv...
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19358–19369 (2023) 4, 22
work page 2023
-
[28]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data (2023) 3, 10
work page 2023
-
[30]
In: Rogers, A., Boyd-Graber, J., Okazaki, N
Fu, X., Zhang, S., Kwon, G., Perera, P., Zhu, H., Zhang, Y., Li, A.H., Wang, W.Y., Wang, Z., Castelli, V., Ng, P., Roth, D., Xiang, B.: Generate then select: Open-ended visual question answering guided by world knowledge. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023. pp. 2333–2346....
-
[31]
In: Muresan, S., Nakov, P., Villavicencio, A
Fu, X., Zhou, B., Chandratreya, I., Vondrick, C., Roth, D.: There’s a time and place for reasoning beyond the image. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1138–1149. Association for Computational Linguistics, Dublin, Ireland ...
-
[32]
arXiv preprint arXiv:2305.14882 (2023) 4
Fu, X., Zhou, B., Chen, S., Yatskar, M., Roth, D.: Interpretable by design visual question answering. arXiv preprint arXiv:2305.14882 (2023) 4
-
[33]
In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 4
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 4
work page 2017
-
[34]
Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models (2023) 5
work page 2023
-
[35]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5356–5364 (2019) 3, 9 18 Fu et al
work page 2019
-
[36]
Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey vision conference. vol. 15, pp. 10–5244. Citeseer (1988) 2
work page 1988
-
[37]
Cambridge university press (2003) 2
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003) 2
work page 2003
-
[38]
arXiv preprint arXiv:2211.09699 (2022) 2, 4
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022) 2, 4
-
[39]
arXiv preprint arXiv:2303.11897 (2023) 4
Hu, Y., Liu, B., Kasai, J., Wang, Y., Ostendorf, M., Krishna, R., Smith, N.A.: Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897 (2023) 4
-
[40]
arXiv preprint arXiv:2312.03052 (2023) 14
Hu, Y., Stretcu, O., Lu, C.T., Viswanathan, K., Hata, K., Luo, E., Krishna, R., Fuxman, A.: Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. arXiv preprint arXiv:2312.03052 (2023) 14
-
[41]
International journal of computer vision123, 32–73 (2017) 3, 4
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123, 32–73 (2017) 3, 4
work page 2017
-
[42]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lai, Z., Purushwalkam, S., Gupta, A.: The functional correspondence problem. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15772–15781 (2021) 3, 10
work page 2021
-
[43]
arXiv preprint arXiv:2311.17092 (2023) 2, 5
Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023) 2, 5
-
[44]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023) 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 4, 23
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014) 3, 4, 5, 10
work page 2014
-
[47]
Transactions of the Association for Computational Linguistics11, 635–651 (2023) 2, 9, 21
Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics11, 635–651 (2023) 2, 9, 21
work page 2023
-
[48]
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 4, 12, 23, 24
work page 2023
-
[49]
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 11
work page 2023
-
[50]
io/blog/2024-01-30-llava-next/ 2, 4, 8, 11, 12, 23, 24
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/ 2, 4, 8, 11, 12, 23, 24
work page 2024
-
[51]
Advances in neural information processing systems36 (2024) 2, 11
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36 (2024) 2, 11
work page 2024
-
[52]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2023) 2, 3, 5, 9, 13, 21
work page 2023
-
[53]
arXiv preprint arXiv:2305.07895 (2023) 2
Liu, Y., Li, Z., Li, H., Yu, W., Huang, M., Peng, D., Liu, M., Chen, M., Li, C., Jin, L., et al.: On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895 (2023) 2
-
[54]
Liu, Y., Li, Z., Yang, B., Li, C., Yin, X., lin Liu, C., Jin, L., Bai, X.: On the hidden mystery of ocr in large multimodal models (2024) 4 Blink 19
work page 2024
-
[55]
In: Proceedings of the seventh IEEE international conference on computer vision
Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision. vol. 2, pp. 1150–1157. Ieee (1999) 2
work page 1999
-
[56]
arXiv preprint arXiv:2312.17172 (2023) 2
Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172 (2023) 2
-
[57]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Marr, D.: Vision: A computational investigation into the human representation and processing of visual information. MIT press (2010) 2
work page 2010
-
[59]
Science 194(4262), 283–287 (1976) 2
Marr, D., Poggio, T.: Cooperative computation of stereo disparity: A cooperative algorithm is derived for extracting disparity information from stereo image pairs. Science 194(4262), 283–287 (1976) 2
work page 1976
-
[60]
arXiv prepreint arXiv:1908.10543 (2019) 10
Min, J., Lee, J., Ponce, J., Cho, M.: Spair-71k: A large-scale benchmark for semantic correspondence. arXiv prepreint arXiv:1908.10543 (2019) 10
-
[61]
Cambridge tiass., HIT479(480), 104 (1969) 2
Minsky, M., Papert, S.: An introduction to computational geometry. Cambridge tiass., HIT479(480), 104 (1969) 2
work page 1969
-
[62]
OpenAI: Gpt-4 technical report (2023) 2, 4, 5, 8, 11, 12, 23, 24, 25
work page 2023
-
[63]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 4, 22, 23
work page 2021
-
[65]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[66]
In: European Conference on Computer Vision
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European Conference on Computer Vision. pp. 146–162. Springer (2022) 4
work page 2022
-
[67]
arXiv preprint arXiv:2304.06712 (2023) 14
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 14
-
[68]
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: Detector-free local feature matching with transformers. CVPR (2021) 14
work page 2021
-
[69]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 4, 23
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
arXiv preprint arXiv:2306.03881 (2023) 14
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881 (2023) 14
-
[71]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 2, 4, 8, 11, 12, 23, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
https://github.com/InternLM/InternLM (2023) 12, 23, 24
Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM (2023) 12, 23, 24
work page 2023
-
[73]
Team, M.N.: Introducing mpt-7b: A new standard for open-source, commercially usable llms (2023),www.mosaicml.com/blog/mpt-7b, accessed: 2023-05-05 22
work page 2023
-
[74]
IEEE Transactions on pattern analysis and machine intelligence24(9), 1226–1238 (2002) 2 20 Fu et al
Torralba, A., Oliva, A.: Depth estimation from image structure. IEEE Transactions on pattern analysis and machine intelligence24(9), 1226–1238 (2002) 2 20 Fu et al
work page 2002
-
[75]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 4, 22
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
In: Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition
Wang, J.Y., Adelson, E.H.: Layered representation for motion analysis. In: Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 361–366. IEEE (1993) 2
work page 1993
-
[77]
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 11, 12, 23, 24
work page 2023
-
[78]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022) 26
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[79]
arXiv preprint arXiv:2303.09295 (2023) 14
Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection. arXiv preprint arXiv:2303.09295 (2023) 14
-
[80]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2022) 4
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.