{"paper":{"title":"OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"OCRBench evaluates large multimodal models on 29 OCR datasets to expose their specific weaknesses in text recognition tasks.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Biao Yang, Cheng-Lin Liu, Chunyuan Li, Lianwen Jin, Mingxin Huang, Wenwen Yu, Xiang Bai, Xucheng Yin, Yuliang Liu, Zhang Li","submitted_at":"2023-05-13T11:28:37Z","abstract_excerpt":"Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Charac"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 29 chosen datasets together form a representative and non-redundant sample of all text-related visual challenges that large multimodal models will encounter in practice.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"OCRBench evaluates large multimodal models on 29 OCR datasets to expose their specific weaknesses in text recognition tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"428e0c0f10e965bedfa7a01d3926d3ce84a3be171f2bc2f6ae97c2cd344153c6"},"source":{"id":"2305.07895","kind":"arxiv","version":7},"verdict":{"id":"becea05d-6d4a-4be0-8f09-e1f7e34325c7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T09:49:42.939904Z","strongest_claim":"To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.","one_line_summary":"OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 29 chosen datasets together form a representative and non-redundant sample of all text-related visual challenges that large multimodal models will encounter in practice.","pith_extraction_headline":"OCRBench evaluates large multimodal models on 29 OCR datasets to expose their specific weaknesses in text recognition tasks."},"references":{"count":122,"sample":[{"doi":"","year":2023,"title":"OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023","work_id":"07eb0a06-5091-41c0-b751-d00d8cff832a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Gpt-4 technical report","work_id":"388f534c-855a-4366-b933-f07bf3e2db5f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","ref_index":3,"cited_arxiv_id":"2302.13971","is_internal_anchor":true},{"doi":"","year":2023,"title":"Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.co","work_id":"bf3517b5-0f2c-4f46-bff1-8d74220ebc3f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality","work_id":"67dc94e1-9c8e-4287-ae6c-979bce9614cf","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":122,"snapshot_sha256":"b48d398871d3ff6d4ec90f8f42409ca8255fc90f9ac0fe3016784e9780d9745e","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"63fd31f74d6ea747e55fce5990745bf740d2afbe75834ef3379c0e8dd0753a60"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}