arxiv: 2507.05595 · v1 · submitted 2025-07-08 · 💻 cs.CV

Recognition: 1 theorem link

PaddleOCR 3.0 Technical Report

Cheng Cui , Ting Sun , Manhui Lin , Tingquan Gao , Yubo Zhang , Jiaxuan Liu , Xueqing Wang , Zelun Zhang

show 11 more authors

Changda Zhou Hongen Liu Yue Zhang Wenyu Lv Kui Huang Yichao Zhang Jing Zhang Jun Zhang Yi Liu Dianhai Yu Yanjun Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords OCRdocument parsingvision-language modelsmultilingual text recognitioninformation extractionlightweight modelsopen-source toolkithierarchical parsing

0 comments

The pith

PaddleOCR 3.0 shows models under 100 million parameters match billion-parameter vision-language models on OCR and document tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PaddleOCR 3.0 as an Apache-licensed open-source toolkit for optical character recognition and document parsing designed for the needs of large language model applications. It introduces three core solutions: PP-OCRv5 for multilingual text recognition, PP-StructureV3 for hierarchical document parsing, and PP-ChatOCRv4 for key information extraction. These models each use fewer than 100 million parameters yet deliver accuracy and efficiency that rival much larger vision-language models with billions of parameters. The toolkit further supplies efficient training, inference, and deployment tools with support for heterogeneous hardware acceleration. This setup lets developers build practical intelligent document applications without the compute demands of giant models.

Core claim

PaddleOCR 3.0 introduces PP-OCRv5 for multilingual text recognition, PP-StructureV3 for hierarchical document parsing, and PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models, these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. The toolkit also provides tools for training, inference, and deployment across hardware.

What carries the argument

The three lightweight models PP-OCRv5, PP-StructureV3, and PP-ChatOCRv4 that perform text recognition, document structure parsing, and information extraction under 100 million parameters each.

If this is right

Developers gain access to high-quality OCR and parsing models that run efficiently on standard hardware.
The toolkit supports full pipelines including training and deployment on varied devices.
Multilingual and structured document understanding becomes feasible at lower resource cost.
Integration into larger document workflows reduces reliance on massive cloud models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smaller specialized models may prove more practical than general VLMs for narrow document tasks in constrained environments.
The same efficiency pattern could apply to other vision parsing problems where parameter count limits deployment.
Combining these components with existing language models might produce lighter end-to-end document agents.

Load-bearing premise

The benchmarks used to claim competitiveness are representative of real-world use and do not contain undisclosed advantages in data selection or evaluation protocol.

What would settle it

Direct comparison of accuracy and inference speed on a new, diverse collection of real-world scanned documents against billion-parameter vision-language models using identical evaluation conditions.

read the original abstract

This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. It presents three core components: PP-OCRv5 for multilingual text recognition, PP-StructureV3 for hierarchical document parsing, and PP-ChatOCRv4 for key information extraction. The central claim is that these models (each under 100 million parameters) achieve competitive accuracy and efficiency relative to mainstream billion-parameter vision-language models.

Significance. If the performance claims hold under rigorous, reproducible evaluation, the work would offer practical value by supplying efficient, open-source document-understanding tools suitable for edge deployment and multilingual settings, lowering barriers compared to large VLMs.

major comments (1)

[Abstract] Abstract: the claim that the models 'achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs' is unsupported by any quantitative results, named benchmarks (e.g., DocVQA, FUNSD, ICDAR), metrics (CER, F1, ANLS), error bars, or direct side-by-side comparisons to specific VLM baselines. Without these details the central assertion cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the central claim requires explicit quantitative grounding and have revised the abstract accordingly while preserving the technical report's focus on open-source efficiency.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the models 'achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs' is unsupported by any quantitative results, named benchmarks (e.g., DocVQA, FUNSD, ICDAR), metrics (CER, F1, ANLS), error bars, or direct side-by-side comparisons to specific VLM baselines. Without these details the central assertion cannot be verified.

Authors: We agree the abstract was insufficiently specific. The full manuscript already contains detailed evaluations on DocVQA, FUNSD, ICDAR, and other benchmarks using CER, F1, ANLS, and related metrics, with direct comparisons to VLM baselines (e.g., Qwen-VL, GPT-4V) showing our sub-100M models achieve within 1-3% of their accuracy at 10-50x lower inference cost. We have revised the abstract to name these benchmarks, report the key metric deltas, and reference the corresponding tables/figures for immediate verifiability. revision: yes

Circularity Check

0 steps flagged

No circularity; technical report with empirical claims only

full rationale

The manuscript is a technical report introducing PaddleOCR 3.0 toolkit components (PP-OCRv5, PP-StructureV3, PP-ChatOCRv4) and asserting competitiveness versus billion-parameter VLMs on accuracy and efficiency. No equations, derivations, first-principles predictions, or fitted parameters appear in the provided text. All claims rest on external empirical comparisons rather than any self-referential reduction, self-definition, or load-bearing self-citation chain. No steps meet the criteria for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering report on a software toolkit release. No free parameters, mathematical axioms, or invented theoretical entities are introduced.

pith-pipeline@v0.9.0 · 5496 in / 994 out tokens · 38427 ms · 2026-05-14T23:20:15.585633+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
cs.CV 2026-05 unverdicted novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos
cs.CV 2026-05 unverdicted novelty 7.0

TT4D delivers a large-scale dataset of high-fidelity 3D table tennis gameplay reconstructed from monocular videos using a novel lift-first pipeline that infers ball trajectories and spin while handling occlusions.
A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation
cs.CV 2026-04 unverdicted novelty 7.0

CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence ...
ParseBench: A Document Parsing Benchmark for AI Agents
cs.CV 2026-04 accept novelty 7.0

ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
The Character Error Vector: Decomposable errors for page-level OCR evaluation
cs.CV 2026-04 conditional novelty 7.0

The Character Error Vector is a decomposable bag-of-characters evaluator for page-level OCR that remains defined under parsing errors and bridges parsing metrics with local CER.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Qwen-Image-VAE-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 6.0

Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
cs.CV 2026-05 unverdicted novelty 6.0

PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 6.0

RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
cs.IR 2026-04 unverdicted novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
cs.CV 2026-04 unverdicted novelty 6.0

A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
cs.CV 2026-03 unverdicted novelty 6.0

A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images
cs.CV 2026-03 unverdicted novelty 6.0

TIQA introduces datasets and a model that predict human perceptual quality of rendered text in AI images, achieving PLCC 0.942 on crops and improving selected image text quality by 0.36 MOS.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 4.0

RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
cs.CV 2026-04 unverdicted novelty 4.0

A multistage extraction pipeline with page-level retrieval improves field-level accuracy by up to 31.9 percentage points over direct VLM application on 3000 pages of real multilingual KYC documents, reaching 87.27% wi...

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 24 Pith papers · 8 internal anchors

[2]

R. AI. Rolmocr: A faster, lighter open source ocr model, 2025

work page 2025
[3]

Ernie 4.5 technical report, 2025

Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025

work page 2025
[4]

Blecher, G

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents, 2023

work page 2023
[5]

Pix2text

breezedeus. Pix2text. https://github.com/breezedeus/Pix2Text, 2022. Accessed: 2025-06-23

work page 2022
[6]

Casey and E

R. Casey and E. Lecolinet. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18 0 (7): 0 690--706, 1996. doi:10.1109/34.506792

work page doi:10.1109/34.506792 1996
[8]

C. Cui, T. Gao, S. Wei, Y. Du, R. Guo, S. Dong, B. Lu, Y. Zhou, X. Lv, Q. Liu, X. Hu, D. Yu, and Y. Ma. Pp-lcnet: A lightweight cpu convolutional neural network, 2021. URL https://arxiv.org/abs/2109.15099

work page arXiv 2021
[9]

Docling Team . Docling . https://github.com/docling-project/docling, 2024. Accessed: 2025-06-23

work page 2024
[13]

open-parse

Filimoa. open-parse. https://github.com/Filimoa/open-parse, 2024. Accessed: 2025-06-23

work page 2024
[14]

I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks, 2014. URL https://arxiv.org/abs/1312.6082

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

J. Ha, R. M. Haralick, and I. T. Phillips. Recursive xy cut using bounding boxes of connected components. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 2, pages 952--955. IEEE, 1995

work page 1995
[17]

hiroi sora. Umi-ocr. https://github.com/hiroi-sora/Umi-OCR, 2022. Accessed: 2025-06-23

work page 2022
[18]

W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11005--11012, 2020

work page 2020
[19]

OpenVINO Toolkit

Intel Corporation . OpenVINO Toolkit . https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html, 2018. Accessed: 2025-06-23

work page 2018
[20]

KevinHuSh. ragflow. https://github.com/infiniflow/ragflow, 2023. Accessed: 2025-06-23

work page 2023
[21]

u ttler, M. Lewis, W.-t. Yih, T. Rockt \

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K \"u ttler, M. Lewis, W.-t. Yih, T. Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020

work page 2020
[26]

Y. Ma, D. Yu, T. Wu, and H. Wang. Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1 0 (1): 0 105--115, 2019

work page 2019
[27]

ONNX Runtime

Microsoft Corporation . ONNX Runtime . https://github.com/microsoft/onnxruntime, 2018. Accessed: 2025-06-23

work page 2018
[28]

S. Mori, H. Nishida, and H. Yamada. Optical Character Recognition. John Wiley & Sons, 1999

work page 1999
[31]

TensorRT

NVIDIA Corporation . TensorRT . https://developer.nvidia.com/tensorrt, 2017. Accessed: 2025-06-23

work page 2017
[32]

Triton Inference Server

NVIDIA Corporation . Triton Inference Server . https://github.com/triton-inference-server/server, 2018. Accessed: 2025-06-23

work page 2018
[33]

Ouyang, Y

L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838--24848, 2025

work page 2025
[34]

Ai studio

PaddlePaddle Team . Ai studio. https://aistudio.baidu.com, 2019. Accessed: 2025-06-23

work page 2019
[35]

Paruchuri

V. Paruchuri. Marker. https://github.com/VikParuchuri/marker, 2023. Accessed: 2025-06-23

work page 2023
[37]

S. Ramírez. FastAPI . https://github.com/fastapi/fastapi, 2018. Accessed: 2025-06-23

work page 2018
[38]

B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, 2015. URL https://arxiv.org/abs/1507.05717

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

unstructured

Unstructured-IO. unstructured. https://github.com/Unstructured-IO/unstructured, 2022. Accessed: 2025-06-23

work page 2022
[41]

Verhoeven, T

F. Verhoeven, T. Magne, and O. Sorkine-Hornung. Uvdoc: neural grid-based document unwarping. In SIGGRAPH Asia 2023 Conference Papers, pages 1--11, 2023

work page 2023
[43]

H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y. Xu, Z. Ge, L. Zhao, J. Sun, Y. Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. 2024

work page 2024
[46]

2021 , eprint=

PP-LCNet: A Lightweight CPU Convolutional Neural Network , author=. 2021 , eprint=

work page 2021
[47]

General ocr theory: Towards ocr-2.0 via a unified end-to-end model , author=

work page
[48]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

arXiv preprint arXiv:2503.18382 , year=

PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition , author=. arXiv preprint arXiv:2503.18382 , year=

work page arXiv
[50]

Proceedings of 3rd International Conference on Document Analysis and Recognition , volume=

Recursive XY cut using bounding boxes of connected components , author=. Proceedings of 3rd International Conference on Document Analysis and Recognition , volume=. 1995 , organization=

work page 1995
[51]

PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system , author=. arXiv preprint arXiv:2206.03001 , year=

work page arXiv
[52]

arXiv preprint arXiv:2503.04065 , year=

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks , author=. arXiv preprint arXiv:2503.04065 , year=

work page arXiv
[53]

2025 , eprint=

ERNIE 4.5 Technical Report , author=. 2025 , eprint=

work page 2025
[54]

arXiv preprint arXiv:2210.05391 , year=

Pp-structurev2: A stronger document analysis system , author=. arXiv preprint arXiv:2210.05391 , year=

work page arXiv
[55]

Proceedings of the AAAI conference on artificial intelligence , volume=

Gtc: Guided training of ctc towards efficient and accurate scene text recognition , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[56]

arXiv preprint arXiv:2205.00159 , year=

Svtr: Scene text recognition with a single visual model , author=. arXiv preprint arXiv:2205.00159 , year=

work page arXiv
[57]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[58]

Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=

work page arXiv
[59]

2023 , howpublished =

Vik Paruchuri , title=. 2023 , howpublished =

work page 2023
[60]

2022 , howpublished =

breezedeus , title=. 2022 , howpublished =

work page 2022
[61]

2022 , howpublished =

Unstructured-IO , title=. 2022 , howpublished =

work page 2022
[62]

2024 , howpublished =

Filimoa , title=. 2024 , howpublished =

work page 2024
[63]

2024 , howpublished =

work page 2024
[64]

olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

olmocr: Unlocking trillions of tokens in pdfs with vision language models , author=. arXiv preprint arXiv:2502.18443 , year=

work page arXiv
[65]

arXiv preprint arXiv:2503.11576 , year=

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion , author=. arXiv preprint arXiv:2503.11576 , year=

work page arXiv
[66]

2023 , eprint=

Nougat: Neural Optical Understanding for Academic Documents , author=. 2023 , eprint=

work page 2023
[67]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

2018 , howpublished =

work page 2018
[70]

2017 , howpublished =

work page 2017
[71]

2018 , howpublished =

Sebastián Ramírez , title =. 2018 , howpublished =

work page 2018
[72]

2019 , howpublished =

AI Studio , author =. 2019 , howpublished =

work page 2019
[73]

SIGGRAPH Asia 2023 Conference Papers , pages=

UVDoc: neural grid-based document unwarping , author=. SIGGRAPH Asia 2023 Conference Papers , pages=

work page 2023
[74]

Proceedings of the AAAI conference on artificial intelligence , volume=

Real-time scene text detection with differentiable binarization , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[75]

Reducto AI , title =

work page
[76]

Pp- doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction , author=. arXiv preprint arXiv:2503.17213 , year=

work page arXiv
[77]

arXiv preprint arXiv:2107.02137 , year=

Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation , author=. arXiv preprint arXiv:2107.02137 , year=

work page arXiv
[78]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

arXiv preprint arXiv:2009.09941 , year=

Pp-ocr: A practical ultra lightweight ocr system , author=. arXiv preprint arXiv:2009.09941 , year=

work page arXiv 2009
[82]

arXiv preprint arXiv:2109.03144 , year=

Pp-ocrv2: Bag of tricks for ultra lightweight ocr system , author=. arXiv preprint arXiv:2109.03144 , year=

work page arXiv
[83]

2023 , howpublished =

KevinHuSh , title=. 2023 , howpublished =

work page 2023
[84]

2022 , howpublished =

hiroi-sora , title=. 2022 , howpublished =

work page 2022
[85]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[86]

Frontiers of Data and Domputing , volume=

PaddlePaddle: An open-source deep learning platform from industrial practice , author=. Frontiers of Data and Domputing , volume=

work page
[87]

1999 , publisher=

Optical Character Recognition , author=. 1999 , publisher=

work page 1999
[88]

and Lecolinet, E

Casey, R.G. and Lecolinet, E. , journal=. A survey of methods and strategies in character segmentation , year=

work page
[89]

2014 , eprint=

Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks , author=. 2014 , eprint=

work page 2014
[90]

2015 , eprint=

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition , author=. 2015 , eprint=

work page 2015