{"paper":{"title":"MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MinerU2.5 decouples global layout analysis on downsampled images from local content recognition on native-resolution crops to parse high-resolution documents with state-of-the-art accuracy and lower compute.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bin Wang, Bowen Zhou, Boyu Niu, Bo Zhang, Chao Xu, Conghui He, Dahua Lin, Dechen Lin, Dongsheng Ma, Fangdong Wang, Fan Wu, Guang Liang, Guangyu Wang, Guanlin Shen, Hejun Dong, Huaiyu Gu, Jiang Wu, Jiaqi Wang, Jingzhou Chen, Junbo Niu, Junyuan Zhang, Kai Chen, Keming Wang, Lei Bai, Lijun Wu, Lindong Lu, Linfeng Zhang, Linke Ouyang, Liqun Wei, Lu Chen, Pei Chu, Qianqian Wu, Qintong Zhang, Ruiliang Xu, Rui Zhang, Shasha Wang, Siyi Qian, Tao Chu, Tianyao He, Weijia Li, Wei Li, Wentao Zhang, Wenzheng Zhang, Xiaomeng Zhao, Xiaoyi Dong, Xuanhe Zhou, Yuanhong Zheng, Yuan Qu, Yuanyuan Cao, Yuefeng Sun, Yuhang Zang, Yu Qiao, Zheng Liu, Zhenjiang Jin, Zhenxiang Li, Zhifei Ren, Zhiyuan Zhao, Zhongying Tu, Zhuangcheng Gu, Zirui Tang, Ziyang Miao","submitted_at":"2025-09-26T10:45:48Z","abstract_excerpt":"We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted conten"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That coarse layout analysis performed on downsampled images provides sufficiently accurate guidance for extracting and recognizing native-resolution crops without introducing errors in dense text, complex formulas, or table structures.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MinerU2.5 decouples global layout analysis on downsampled images from local content recognition on native-resolution crops to parse high-resolution documents with state-of-the-art accuracy and lower compute.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"547612727f1c08aa60644d38ff097f0e71b8f711553a944b32fb41efcf0d768b"},"source":{"id":"2509.22186","kind":"arxiv","version":2},"verdict":{"id":"ac65dee8-3b0d-4c15-b1e3-8ae183c0e188","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T13:20:11.638576Z","strongest_claim":"MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.","one_line_summary":"MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That coarse layout analysis performed on downsampled images provides sufficiently accurate guidance for extracting and recognizing native-resolution crops without introducing errors in dense text, complex formulas, or table structures.","pith_extraction_headline":"MinerU2.5 decouples global layout analysis on downsampled images from local content recognition on native-resolution crops to parse high-resolution documents with state-of-the-art accuracy and lower compute."},"references":{"count":63,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2022,"title":"Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding.arXiv preprint arXiv:2212.09621, 2022","work_id":"5ab3d15a-640e-4244-810e-cc619f52dad4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":3,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":2023,"title":"Nougat: Neural Optical Understanding for Academic Documents","work_id":"26c3b627-7e97-40d7-bab3-020936b8196b","ref_index":4,"cited_arxiv_id":"2308.13418","is_internal_anchor":true},{"doi":"","year":2025,"title":"chatdoc com. Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25","work_id":"c9d28b27-542c-4250-990f-746ad3563e7c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":63,"snapshot_sha256":"a041b7580e37103bbc78e81817e5f7e9a6df11758cb78ef7e14efa4eaa020016","internal_anchors":17},"formal_canon":{"evidence_count":2,"snapshot_sha256":"16208874e119a71a327125a432000c5f3a49fae71d98265506d02ce75c3ee3cc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}