Recognition: 1 theorem link
PaddleOCR 3.0 Technical Report
Pith reviewed 2026-05-14 23:20 UTC · model grok-4.3
The pith
PaddleOCR 3.0 shows models under 100 million parameters match billion-parameter vision-language models on OCR and document tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaddleOCR 3.0 introduces PP-OCRv5 for multilingual text recognition, PP-StructureV3 for hierarchical document parsing, and PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models, these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. The toolkit also provides tools for training, inference, and deployment across hardware.
What carries the argument
The three lightweight models PP-OCRv5, PP-StructureV3, and PP-ChatOCRv4 that perform text recognition, document structure parsing, and information extraction under 100 million parameters each.
If this is right
- Developers gain access to high-quality OCR and parsing models that run efficiently on standard hardware.
- The toolkit supports full pipelines including training and deployment on varied devices.
- Multilingual and structured document understanding becomes feasible at lower resource cost.
- Integration into larger document workflows reduces reliance on massive cloud models.
Where Pith is reading between the lines
- Smaller specialized models may prove more practical than general VLMs for narrow document tasks in constrained environments.
- The same efficiency pattern could apply to other vision parsing problems where parameter count limits deployment.
- Combining these components with existing language models might produce lighter end-to-end document agents.
Load-bearing premise
The benchmarks used to claim competitiveness are representative of real-world use and do not contain undisclosed advantages in data selection or evaluation protocol.
What would settle it
Direct comparison of accuracy and inference speed on a new, diverse collection of real-world scanned documents against billion-parameter vision-language models using identical evaluation conditions.
read the original abstract
This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. It presents three core components: PP-OCRv5 for multilingual text recognition, PP-StructureV3 for hierarchical document parsing, and PP-ChatOCRv4 for key information extraction. The central claim is that these models (each under 100 million parameters) achieve competitive accuracy and efficiency relative to mainstream billion-parameter vision-language models.
Significance. If the performance claims hold under rigorous, reproducible evaluation, the work would offer practical value by supplying efficient, open-source document-understanding tools suitable for edge deployment and multilingual settings, lowering barriers compared to large VLMs.
major comments (1)
- [Abstract] Abstract: the claim that the models 'achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs' is unsupported by any quantitative results, named benchmarks (e.g., DocVQA, FUNSD, ICDAR), metrics (CER, F1, ANLS), error bars, or direct side-by-side comparisons to specific VLM baselines. Without these details the central assertion cannot be verified.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the central claim requires explicit quantitative grounding and have revised the abstract accordingly while preserving the technical report's focus on open-source efficiency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the models 'achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs' is unsupported by any quantitative results, named benchmarks (e.g., DocVQA, FUNSD, ICDAR), metrics (CER, F1, ANLS), error bars, or direct side-by-side comparisons to specific VLM baselines. Without these details the central assertion cannot be verified.
Authors: We agree the abstract was insufficiently specific. The full manuscript already contains detailed evaluations on DocVQA, FUNSD, ICDAR, and other benchmarks using CER, F1, ANLS, and related metrics, with direct comparisons to VLM baselines (e.g., Qwen-VL, GPT-4V) showing our sub-100M models achieve within 1-3% of their accuracy at 10-50x lower inference cost. We have revised the abstract to name these benchmarks, report the key metric deltas, and reference the corresponding tables/figures for immediate verifiability. revision: yes
Circularity Check
No circularity; technical report with empirical claims only
full rationale
The manuscript is a technical report introducing PaddleOCR 3.0 toolkit components (PP-OCRv5, PP-StructureV3, PP-ChatOCRv4) and asserting competitiveness versus billion-parameter VLMs on accuracy and efficiency. No equations, derivations, first-principles predictions, or fitted parameters appear in the provided text. All claims rest on external empirical comparisons rather than any self-referential reduction, self-definition, or load-bearing self-citation chain. No steps meet the criteria for circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 25 Pith papers
-
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
-
TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos
TT4D delivers a large-scale dataset of high-fidelity 3D table tennis gameplay reconstructed from monocular videos using a novel lift-first pipeline that infers ball trajectories and spin while handling occlusions.
-
A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation
CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence ...
-
ParseBench: A Document Parsing Benchmark for AI Agents
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
-
The Character Error Vector: Decomposable errors for page-level OCR evaluation
The Character Error Vector is a decomposable bag-of-characters evaluator for page-level OCR that remains defined under parsing errors and bridges parsing metrics with local CER.
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
Qwen-Image-VAE-2.0 Technical Report
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
-
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
-
TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images
TIQA introduces datasets and a model that predict human perceptual quality of rendered text in AI images, achieving PLCC 0.942 on crops and improving selected image text quality by 0.36 MOS.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
-
A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
A multistage extraction pipeline with page-level retrieval improves field-level accuracy by up to 31.9 percentage points over direct VLM application on 3000 pages of real multilingual KYC documents, reaching 87.27% wi...
Reference graph
Works this paper leans on
-
[2]
R. AI. Rolmocr: A faster, lighter open source ocr model, 2025
work page 2025
- [3]
-
[4]
L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents, 2023
work page 2023
- [5]
-
[6]
R. Casey and E. Lecolinet. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18 0 (7): 0 690--706, 1996. doi:10.1109/34.506792
- [8]
-
[9]
Docling Team . Docling . https://github.com/docling-project/docling, 2024. Accessed: 2025-06-23
work page 2024
-
[13]
Filimoa. open-parse. https://github.com/Filimoa/open-parse, 2024. Accessed: 2025-06-23
work page 2024
-
[14]
I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks, 2014. URL https://arxiv.org/abs/1312.6082
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[16]
J. Ha, R. M. Haralick, and I. T. Phillips. Recursive xy cut using bounding boxes of connected components. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 2, pages 952--955. IEEE, 1995
work page 1995
-
[17]
hiroi sora. Umi-ocr. https://github.com/hiroi-sora/Umi-OCR, 2022. Accessed: 2025-06-23
work page 2022
-
[18]
W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11005--11012, 2020
work page 2020
-
[19]
Intel Corporation . OpenVINO Toolkit . https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html, 2018. Accessed: 2025-06-23
work page 2018
-
[20]
KevinHuSh. ragflow. https://github.com/infiniflow/ragflow, 2023. Accessed: 2025-06-23
work page 2023
-
[21]
u ttler, M. Lewis, W.-t. Yih, T. Rockt \
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K \"u ttler, M. Lewis, W.-t. Yih, T. Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020
work page 2020
-
[26]
Y. Ma, D. Yu, T. Wu, and H. Wang. Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1 0 (1): 0 105--115, 2019
work page 2019
-
[27]
Microsoft Corporation . ONNX Runtime . https://github.com/microsoft/onnxruntime, 2018. Accessed: 2025-06-23
work page 2018
-
[28]
S. Mori, H. Nishida, and H. Yamada. Optical Character Recognition. John Wiley & Sons, 1999
work page 1999
- [31]
-
[32]
NVIDIA Corporation . Triton Inference Server . https://github.com/triton-inference-server/server, 2018. Accessed: 2025-06-23
work page 2018
-
[33]
L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838--24848, 2025
work page 2025
- [34]
- [35]
-
[37]
S. Ramírez. FastAPI . https://github.com/fastapi/fastapi, 2018. Accessed: 2025-06-23
work page 2018
-
[38]
B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, 2015. URL https://arxiv.org/abs/1507.05717
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
Unstructured-IO. unstructured. https://github.com/Unstructured-IO/unstructured, 2022. Accessed: 2025-06-23
work page 2022
-
[41]
F. Verhoeven, T. Magne, and O. Sorkine-Hornung. Uvdoc: neural grid-based document unwarping. In SIGGRAPH Asia 2023 Conference Papers, pages 1--11, 2023
work page 2023
-
[43]
H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y. Xu, Z. Ge, L. Zhao, J. Sun, Y. Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. 2024
work page 2024
-
[46]
PP-LCNet: A Lightweight CPU Convolutional Neural Network , author=. 2021 , eprint=
work page 2021
-
[47]
General ocr theory: Towards ocr-2.0 via a unified end-to-end model , author=
-
[48]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
arXiv preprint arXiv:2503.18382 , year=
PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition , author=. arXiv preprint arXiv:2503.18382 , year=
-
[50]
Proceedings of 3rd International Conference on Document Analysis and Recognition , volume=
Recursive XY cut using bounding boxes of connected components , author=. Proceedings of 3rd International Conference on Document Analysis and Recognition , volume=. 1995 , organization=
work page 1995
-
[51]
PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system , author=. arXiv preprint arXiv:2206.03001 , year=
-
[52]
arXiv preprint arXiv:2503.04065 , year=
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks , author=. arXiv preprint arXiv:2503.04065 , year=
- [53]
-
[54]
arXiv preprint arXiv:2210.05391 , year=
Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=
-
[55]
Proceedings of the AAAI conference on artificial intelligence , volume=
Gtc: Guided training of ctc towards efficient and accurate scene text recognition , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[56]
arXiv preprint arXiv:2205.00159 , year=
Svtr: Scene text recognition with a single visual model , author=. arXiv preprint arXiv:2205.00159 , year=
-
[57]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[58]
Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=
- [59]
- [60]
- [61]
- [62]
-
[63]
2024 , howpublished =
work page 2024
-
[64]
olmocr: Unlocking trillions of tokens in pdfs with vision language models , author=. arXiv preprint arXiv:2502.18443 , year=
-
[65]
arXiv preprint arXiv:2503.11576 , year=
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. arXiv preprint arXiv:2503.11576 , year=
-
[66]
Nougat: Neural Optical Understanding for Academic Documents , author=. 2023 , eprint=
work page 2023
-
[67]
Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
2018 , howpublished =
work page 2018
-
[70]
2017 , howpublished =
work page 2017
- [71]
- [72]
-
[73]
SIGGRAPH Asia 2023 Conference Papers , pages=
UVDoc: neural grid-based document unwarping , author=. SIGGRAPH Asia 2023 Conference Papers , pages=
work page 2023
-
[74]
Proceedings of the AAAI conference on artificial intelligence , volume=
Real-time scene text detection with differentiable binarization , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[75]
Reducto AI , title =
-
[76]
PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction , author=. arXiv preprint arXiv:2503.17213 , year=
-
[77]
arXiv preprint arXiv:2107.02137 , year=
Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation , author=. arXiv preprint arXiv:2107.02137 , year=
-
[78]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
arXiv preprint arXiv:2009.09941 , year=
Pp-ocr: A practical ultra lightweight ocr system , author=. arXiv preprint arXiv:2009.09941 , year=
-
[82]
arXiv preprint arXiv:2109.03144 , year=
Pp-ocrv2: Bag of tricks for ultra lightweight ocr system , author=. arXiv preprint arXiv:2109.03144 , year=
- [83]
- [84]
-
[85]
Advances in neural information processing systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
-
[86]
Frontiers of Data and Domputing , volume=
PaddlePaddle: An open-source deep learning platform from industrial practice , author=. Frontiers of Data and Domputing , volume=
- [87]
-
[88]
Casey, R.G. and Lecolinet, E. , journal=. A survey of methods and strategies in character segmentation , year=
-
[89]
Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks , author=. 2014 , eprint=
work page 2014
-
[90]
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition , author=. 2015 , eprint=
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.