pith. sign in
Pith Number

pith:VIEYAOND

pith:2024:VIEYAONDM5LUTKLMMNKATGG3N7
not attested not anchored not stored refs resolved

BLINK: Multimodal Large Language Models Can See but Not Perceive

Bangzheng Li, Dan Roth, Haoyu Wang, Noah A. Smith, Ranjay Krishna, Wei-Chiu Ma, Xingyu Fu, Xudong Lin, Yu Feng, Yushi Hu

Multimodal LLMs like GPT-4V reach only 51% accuracy on visual perception tasks that humans solve at 96%.

arxiv:2404.12390 v4 · 2024-04-18 · cs.CV · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VIEYAONDM5LUTKLMMNKATGG3N7}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not emerged yet in recent multimodal LLMs

C2weakest assumption

That the selected tasks genuinely require visual perception that cannot be solved through language patterns or statistical shortcuts in the training data.

C3one line summary

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.

References

90 extracted · 90 resolved · 20 Pith anchors

[1] Introducing the next generation of claude.https://www.anthropic.com/news/ claude-3-family (March 2024) 11, 12, 23, 24 2024
[2] In: AAAI (2019) 10 2019
[3] Advances in Neural Information Processing Systems35, 23716–23736 (2022) 2, 4, 22 2022
[4] In: Proceedings of the IEEE international conference on computer vision 2015
[5] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models 2023 · arXiv:2308.01390

Formal links

2 machine-checked theorem links

Cited by

33 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.297986Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

aa098039a3675749a96c63540998db6fc6907ba0875170782140cef6079be0de

Aliases

arxiv: 2404.12390 · arxiv_version: 2404.12390v4 · doi: 10.48550/arxiv.2404.12390 · pith_short_12: VIEYAONDM5LU · pith_short_16: VIEYAONDM5LUTKLM · pith_short_8: VIEYAOND
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VIEYAONDM5LUTKLMMNKATGG3N7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: aa098039a3675749a96c63540998db6fc6907ba0875170782140cef6079be0de
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "dd25bcb3e35202474023a787b0b9d122840766b9a54178a832f88e9f180d9e66",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-04-18T17:59:54Z",
    "title_canon_sha256": "4d8fd9e1fea6457fae3bc1f04cdd373d055d3fb0b8cdf6f80054724814cfc882"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2404.12390",
    "kind": "arxiv",
    "version": 4
  }
}