DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
arXiv preprint arXiv:2509.01215 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
PaddleOCR-VL-1.5 is a 0.9B VLM achieving 94.5% SOTA accuracy on OmniDocBench v1.5, with added robustness to physical distortions and support for seal recognition plus text spotting.
citing papers explorer
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
-
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
-
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
PaddleOCR-VL-1.5 is a 0.9B VLM achieving 94.5% SOTA accuracy on OmniDocBench v1.5, with added robustness to physical distortions and support for seal recognition plus text spotting.