{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:OS4GEAZFRAHNUBG7PVAC4WBW5T","short_pith_number":"pith:OS4GEAZF","schema_version":"1.0","canonical_sha256":"74b8620325880eda04df7d402e5836ecf84b4e03d4c6959096231f39559c4cf7","source":{"kind":"arxiv","id":"2501.00321","version":2},"attestation_state":"computed","paper":{"title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Biao Yang, Binghong Wu, Bin Shan, Can Huang, Chunhui Lin, Guozhi Tang, Hao Feng, Hao Liu, Hao Lu, Jiajun Song, Jingqun Tang, Lianwen Jin, Ling Fu, Linghao Zhu, Mingxin Huang, Qidi Luo, Qi Liu, Wei Chen, Xiang Bai, Xinyu Wang, Yuliang Liu, Yuzhe Li, Zhang Li, Zhebin Kuang","submitted_at":"2024-12-31T07:32:35Z","abstract_excerpt":"Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2501.00321","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-12-31T07:32:35Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"784fe27428b4eab38602e77a2b9c56620512c96b39067546e703d274c050939e","abstract_canon_sha256":"011b8e29481deb917e48675e59fec8ebd7a34fa2db59936c61cb7fcc55a6ccc9"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.153454Z","signature_b64":"bLuOLG2VLlpEnLZWv1mEbTfGMPNv5afgKtgc0wV1lCB9yxTw2TknC9k+/hAheXHMPhaSVln3rzHIsZCRaaQECg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"74b8620325880eda04df7d402e5836ecf84b4e03d4c6959096231f39559c4cf7","last_reissued_at":"2026-05-17T23:38:13.152917Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.152917Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Biao Yang, Binghong Wu, Bin Shan, Can Huang, Chunhui Lin, Guozhi Tang, Hao Feng, Hao Liu, Hao Lu, Jiajun Song, Jingqun Tang, Lianwen Jin, Ling Fu, Linghao Zhu, Mingxin Huang, Qidi Luo, Qi Liu, Wei Chen, Xiang Bai, Xinyu Wang, Yuliang Liu, Yuzhe Li, Zhang Li, Zhebin Kuang","submitted_at":"2024-12-31T07:32:35Z","abstract_excerpt":"Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen 31 scenarios and 10,000 human-verified question-answer pairs, together with the private test set, provide an unbiased and comprehensive measure of the five claimed limitations without selection effects that favor certain model failure modes.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"794728bcf6f4de215cca85d6e91e04b1b56ba6132f5d7c118ce2a58999697f0b"},"source":{"id":"2501.00321","kind":"arxiv","version":2},"verdict":{"id":"8fd6154c-8f6f-4e5c-9db5-bec269d11583","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:29:10.364789Z","strongest_claim":"After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.","one_line_summary":"OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen 31 scenarios and 10,000 human-verified question-answer pairs, together with the private test set, provide an unbiased and comprehensive measure of the five claimed limitations without selection effects that favor certain model failure modes.","pith_extraction_headline":"A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks."},"references":{"count":156,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2023,"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","ref_index":2,"cited_arxiv_id":"2302.13971","is_internal_anchor":true},{"doi":"","year":2020,"title":"Language models are few-shot learners,","work_id":"435448fd-e4c0-479b-9769-6a7e11a7a63d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":4,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2024,"title":"H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, 2024","work_id":"d9185768-bb71-4cd6-8b2d-3856483d4c46","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":156,"snapshot_sha256":"067a6d04411e59416800099a0e42b23bbdc2a363570a3e38a136016b29949d7a","internal_anchors":29},"formal_canon":{"evidence_count":2,"snapshot_sha256":"83b71098e338777915b40ba65864470c49ade0f498810861ef331e7d922b8166"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2501.00321","created_at":"2026-05-17T23:38:13.153008+00:00"},{"alias_kind":"arxiv_version","alias_value":"2501.00321v2","created_at":"2026-05-17T23:38:13.153008+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2501.00321","created_at":"2026-05-17T23:38:13.153008+00:00"},{"alias_kind":"pith_short_12","alias_value":"OS4GEAZFRAHN","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"OS4GEAZFRAHNUBG7","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"OS4GEAZF","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2605.20278","citing_title":"ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18173","citing_title":"Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19929","citing_title":"Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14998","citing_title":"FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22186","citing_title":"MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2603.19790","citing_title":"From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03339","citing_title":"Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11960","citing_title":"Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12500","citing_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11301","citing_title":"LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11462","citing_title":"SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03903","citing_title":"CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00885","citing_title":"Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19858","citing_title":"Wan-Image: Pushing the Boundaries of Generative Visual Intelligence","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08538","citing_title":"ParseBench: A Document Parsing Benchmark for AI Agents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07492","citing_title":"How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04733","citing_title":"Discovering Failure Modes in Vision-Language Models using RL","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04411","citing_title":"Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19259","citing_title":"Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25359","citing_title":"The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models","ref_index":22,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T","json":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T.json","graph_json":"https://pith.science/api/pith-number/OS4GEAZFRAHNUBG7PVAC4WBW5T/graph.json","events_json":"https://pith.science/api/pith-number/OS4GEAZFRAHNUBG7PVAC4WBW5T/events.json","paper":"https://pith.science/paper/OS4GEAZF"},"agent_actions":{"view_html":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T","download_json":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T.json","view_paper":"https://pith.science/paper/OS4GEAZF","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2501.00321&json=true","fetch_graph":"https://pith.science/api/pith-number/OS4GEAZFRAHNUBG7PVAC4WBW5T/graph.json","fetch_events":"https://pith.science/api/pith-number/OS4GEAZFRAHNUBG7PVAC4WBW5T/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T/action/timestamp_anchor","attest_storage":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T/action/storage_attestation","attest_author":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T/action/author_attestation","sign_citation":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T/action/citation_signature","submit_replication":"https://pith.science/pith/OS4GEAZFRAHNUBG7PVAC4WBW5T/action/replication_record"}},"created_at":"2026-05-17T23:38:13.153008+00:00","updated_at":"2026-05-17T23:38:13.153008+00:00"}