{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:VPF5YSOVDWVNSVG4S7ZGZN6YRS","short_pith_number":"pith:VPF5YSOV","schema_version":"1.0","canonical_sha256":"abcbdc49d51daad954dc97f26cb7d88cb26f4e9f5be16636c71bdb3d83838c56","source":{"kind":"arxiv","id":"2409.01704","version":1},"attestation_state":"computed","paper":{"title":"General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chenglong Liu, Chunrui Han, Haoran Wei, Jianjian Sun, Jia Wang, Jinyue Chen, Liang Zhao, Lingyu Kong, Xiangyu Zhang, Yanming Xu, Yuang Peng, Zheng Ge","submitted_at":"2024-09-03T08:41:31Z","abstract_excerpt":"Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as \"characters\" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-conte"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2409.01704","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.CV","submitted_at":"2024-09-03T08:41:31Z","cross_cats_sorted":[],"title_canon_sha256":"3f2bbec1951d819bd39a2296e8c2b4200d4a4fea581a6aefc7dfd61787c8bda4","abstract_canon_sha256":"42b6d6c11459ec2495155196491af0a58729ade575db81501be3ff3c46cd15c5"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.115483Z","signature_b64":"Or0+6wbbW8Y0hPAq5eYbMr8HYvQXq/n5g7cG7BOtHJQ/DdTMolLp/MOxAwA4QMZXth0KnZ7Af2DuHuS1u5qzCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"abcbdc49d51daad954dc97f26cb7d88cb26f4e9f5be16636c71bdb3d83838c56","last_reissued_at":"2026-05-17T23:38:13.114827Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.114827Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chenglong Liu, Chunrui Han, Haoran Wei, Jianjian Sun, Jia Wang, Jinyue Chen, Liang Zhao, Lingyu Kong, Xiangyu Zhang, Yanming Xu, Yuang Peng, Zheng Ge","submitted_at":"2024-09-03T08:41:31Z","abstract_excerpt":"Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as \"characters\" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-conte"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as 'characters' and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT ... can handle all the above 'characters' under various OCR tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a single 580M-parameter end-to-end model with prompt-based output formatting can maintain high accuracy across all listed character types and input styles without requiring task-specific components or suffering from interference between them.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"14e91852765b7f7f7589a219f44fce9aa8e0a87d822d1a1bb71d52661b85a111"},"source":{"id":"2409.01704","kind":"arxiv","version":1},"verdict":{"id":"ee74def5-05bf-4541-8ba8-fe34908683e9","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:46:27.603037Z","strongest_claim":"we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as 'characters' and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT ... can handle all the above 'characters' under various OCR tasks.","one_line_summary":"GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a single 580M-parameter end-to-end model with prompt-based output formatting can maintain high accuracy across all listed character types and input styles without requiring task-specific components or suffering from interference between them.","pith_extraction_headline":"A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters."},"references":{"count":55,"sample":[{"doi":"","year":2024,"title":"https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6","work_id":"efa2f0aa-94bb-4f8a-be9a-ef30d147d703","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"https://huggingface.co/datasets/Teklia/IAM-line (2024) 6","work_id":"bade56df-693e-494c-bd89-644ff64c339d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6","work_id":"e47f2fad-800e-48cd-a291-6eaddb1fe6fd","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","ref_index":4,"cited_arxiv_id":"2309.16609","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":5,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":55,"snapshot_sha256":"327eaeb634c21159b954b31fe4a7a3805485d28f72c26f0c4584d75a4bffecbb","internal_anchors":10},"formal_canon":{"evidence_count":2,"snapshot_sha256":"95a8ba08337e34da427262fdeb3c674da7b763217a8d87eeb88269f7ad796053"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.01704","created_at":"2026-05-17T23:38:13.114936+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.01704v1","created_at":"2026-05-17T23:38:13.114936+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.01704","created_at":"2026-05-17T23:38:13.114936+00:00"},{"alias_kind":"pith_short_12","alias_value":"VPF5YSOVDWVN","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"VPF5YSOVDWVNSVG4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"VPF5YSOV","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2501.00321","citing_title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14998","citing_title":"FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22186","citing_title":"MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2601.04068","citing_title":"Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2601.09298","citing_title":"Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01785","citing_title":"CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2409.18839","citing_title":"MinerU: An Open-Source Solution for Precise Document Content Extraction","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2603.23885","citing_title":"Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2603.24326","citing_title":"Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12623","citing_title":"DocAtlas: Multilingual Document Understanding Across 80+ Languages","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.00270","citing_title":"OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02880","citing_title":"InstructTable: Improving Table Structure Recognition Through Instructions","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09343","citing_title":"SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18234","citing_title":"DeepSeek-OCR: Contexts Optical Compression","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2502.16982","citing_title":"Muon is Scalable for LLM Training","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04771","citing_title":"MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16070","citing_title":"TableSeq: Unified Generation of Structure, Content, and Layout","ref_index":54,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS","json":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS.json","graph_json":"https://pith.science/api/pith-number/VPF5YSOVDWVNSVG4S7ZGZN6YRS/graph.json","events_json":"https://pith.science/api/pith-number/VPF5YSOVDWVNSVG4S7ZGZN6YRS/events.json","paper":"https://pith.science/paper/VPF5YSOV"},"agent_actions":{"view_html":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS","download_json":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS.json","view_paper":"https://pith.science/paper/VPF5YSOV","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.01704&json=true","fetch_graph":"https://pith.science/api/pith-number/VPF5YSOVDWVNSVG4S7ZGZN6YRS/graph.json","fetch_events":"https://pith.science/api/pith-number/VPF5YSOVDWVNSVG4S7ZGZN6YRS/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS/action/timestamp_anchor","attest_storage":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS/action/storage_attestation","attest_author":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS/action/author_attestation","sign_citation":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS/action/citation_signature","submit_replication":"https://pith.science/pith/VPF5YSOVDWVNSVG4S7ZGZN6YRS/action/replication_record"}},"created_at":"2026-05-17T23:38:13.114936+00:00","updated_at":"2026-05-17T23:38:13.114936+00:00"}