pith. sign in
Pith Number

pith:OS4GEAZF

pith:2024:OS4GEAZFRAHNUBG7PVAC4WBW5T
not attested not anchored not stored refs resolved

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Biao Yang, Binghong Wu, Bin Shan, Can Huang, Chunhui Lin, Guozhi Tang, Hao Feng, Hao Liu, Hao Lu, Jiajun Song, Jingqun Tang, Lianwen Jin, Ling Fu, Linghao Zhu, Mingxin Huang, Qidi Luo, Qi Liu, Wei Chen, Xiang Bai, Xinyu Wang, Yuliang Liu, Yuzhe Li, Zhang Li, Zhebin Kuang

A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.

arxiv:2501.00321 v2 · 2024-12-31 · cs.CV · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{OS4GEAZFRAHNUBG7PVAC4WBW5T}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

C2weakest assumption

That the chosen 31 scenarios and 10,000 human-verified question-answer pairs, together with the private test set, provide an unbiased and comprehensive measure of the five claimed limitations without selection effects that favor certain model failure modes.

C3one line summary

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

References

156 extracted · 156 resolved · 29 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774
[2] LLaMA: Open and Efficient Foundation Language Models 2023 · arXiv:2302.13971
[3] Language models are few-shot learners, 2020
[4] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[5] H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, 2024 2024

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.152917Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

74b8620325880eda04df7d402e5836ecf84b4e03d4c6959096231f39559c4cf7

Aliases

arxiv: 2501.00321 · arxiv_version: 2501.00321v2 · doi: 10.48550/arxiv.2501.00321 · pith_short_12: OS4GEAZFRAHN · pith_short_16: OS4GEAZFRAHNUBG7 · pith_short_8: OS4GEAZF
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 74b8620325880eda04df7d402e5836ecf84b4e03d4c6959096231f39559c4cf7
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "011b8e29481deb917e48675e59fec8ebd7a34fa2db59936c61cb7fcc55a6ccc9",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-12-31T07:32:35Z",
    "title_canon_sha256": "784fe27428b4eab38602e77a2b9c56620512c96b39067546e703d274c050939e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.00321",
    "kind": "arxiv",
    "version": 2
  }
}