{"paper":{"title":"General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chenglong Liu, Chunrui Han, Haoran Wei, Jianjian Sun, Jia Wang, Jinyue Chen, Liang Zhao, Lingyu Kong, Xiangyu Zhang, Yanming Xu, Yuang Peng, Zheng Ge","submitted_at":"2024-09-03T08:41:31Z","abstract_excerpt":"Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as \"characters\" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-conte"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as 'characters' and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT ... can handle all the above 'characters' under various OCR tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a single 580M-parameter end-to-end model with prompt-based output formatting can maintain high accuracy across all listed character types and input styles without requiring task-specific components or suffering from interference between them.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"14e91852765b7f7f7589a219f44fce9aa8e0a87d822d1a1bb71d52661b85a111"},"source":{"id":"2409.01704","kind":"arxiv","version":1},"verdict":{"id":"ee74def5-05bf-4541-8ba8-fe34908683e9","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:46:27.603037Z","strongest_claim":"we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as 'characters' and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT ... can handle all the above 'characters' under various OCR tasks.","one_line_summary":"GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a single 580M-parameter end-to-end model with prompt-based output formatting can maintain high accuracy across all listed character types and input styles without requiring task-specific components or suffering from interference between them.","pith_extraction_headline":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters."},"references":{"count":55,"sample":[{"doi":"","year":2024,"title":"https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6","work_id":"efa2f0aa-94bb-4f8a-be9a-ef30d147d703","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"https://huggingface.co/datasets/Teklia/IAM-line (2024) 6","work_id":"bade56df-693e-494c-bd89-644ff64c339d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6","work_id":"e47f2fad-800e-48cd-a291-6eaddb1fe6fd","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","ref_index":4,"cited_arxiv_id":"2309.16609","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":5,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":55,"snapshot_sha256":"327eaeb634c21159b954b31fe4a7a3805485d28f72c26f0c4584d75a4bffecbb","internal_anchors":10},"formal_canon":{"evidence_count":2,"snapshot_sha256":"95a8ba08337e34da427262fdeb3c674da7b763217a8d87eeb88269f7ad796053"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}