Pith Number

pith:OS4GEAZF

pith:2024:OS4GEAZFRAHNUBG7PVAC4WBW5T

not attested not anchored not stored refs resolved

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Biao Yang, Binghong Wu, Bin Shan, Can Huang, Chunhui Lin, Guozhi Tang, Hao Feng, Hao Liu, Hao Lu, Jiajun Song, Jingqun Tang, Lianwen Jin, Ling Fu, Linghao Zhu, Mingxin Huang, Qidi Luo, Qi Liu, Wei Chen, Xiang Bai, Xinyu Wang, Yuliang Liu, Yuzhe Li, Zhang Li, Zhebin Kuang

A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.

arxiv:2501.00321 v2 · 2024-12-31 · cs.CV · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{OS4GEAZFRAHNUBG7PVAC4WBW5T}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

C2weakest assumption

That the chosen 31 scenarios and 10,000 human-verified question-answer pairs, together with the private test set, provide an unbiased and comprehensive measure of the five claimed limitations without selection effects that favor certain model failure modes.

C3one line summary

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

References

156 extracted · 156 resolved · 29 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774

[2] LLaMA: Open and Efficient Foundation Language Models 2023 · arXiv:2302.13971

[3] Language models are few-shot learners, 2020

[4] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

[5] H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, 2024 2024

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Receipt and verification

First computed	2026-05-17T23:38:13.152917Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

74b8620325880eda04df7d402e5836ecf84b4e03d4c6959096231f39559c4cf7

Aliases

arxiv: 2501.00321 · arxiv_version: 2501.00321v2 · doi: 10.48550/arxiv.2501.00321 · pith_short_12: OS4GEAZFRAHN · pith_short_16: OS4GEAZFRAHNUBG7 · pith_short_8: OS4GEAZF

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 74b8620325880eda04df7d402e5836ecf84b4e03d4c6959096231f39559c4cf7

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "011b8e29481deb917e48675e59fec8ebd7a34fa2db59936c61cb7fcc55a6ccc9",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-12-31T07:32:35Z",
    "title_canon_sha256": "784fe27428b4eab38602e77a2b9c56620512c96b39067546e703d274c050939e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.00321",
    "kind": "arxiv",
    "version": 2
  }
}