PP-OCR: A Practical Ultra Lightweight OCR System
read the original abstract
The Optical Character Recognition (OCR) systems have been widely used in various of application scenarios, such as office automation (OA) systems, factory automations, online educations, map productions etc. However, OCR is still a challenging task due to the various of text appearances and the demand of computational efficiency. In this paper, we propose a practical ultra lightweight OCR system, i.e., PP-OCR. The overall model size of the PP-OCR is only 3.5M for recognizing 6622 Chinese characters and 2.8M for recognizing 63 alphanumeric symbols, respectively. We introduce a bag of strategies to either enhance the model ability or reduce the model size. The corresponding ablation experiments with the real data are also provided. Meanwhile, several pre-trained models for the Chinese and English recognition are released, including a text detector (97K images are used), a direction classifier (600K images are used) as well as a text recognizer (17.9M images are used). Besides, the proposed PP-OCR are also verified in several other language recognition tasks, including French, Korean, Japanese and German. All of the above mentioned models are open-sourced and the codes are available in the GitHub repository, i.e., https://github.com/PaddlePaddle/PaddleOCR.
This paper has not been read by Pith yet.
Forward citations
Cited by 24 Pith papers
-
Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media
Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.
-
Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models
SciDraw-Bench provides 32 structured tasks and a four-dimensional protocol to evaluate text-to-image models on scientific figure generation, with a domain-specific system outperforming general baselines in a pilot.
-
StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting
StyleText is a new large-scale dataset and benchmark for stylized scene text inpainting, constructed via an automated pipeline and paired with a FluxFill+LoRA baseline that improves OCR accuracy.
-
Edit Fidelity Field: Semantics-Aware Region Isolation for Training-Free Scene Text Editing
Edit Fidelity Field reduces edit spillover in scene text editing from 94% to 25% by constructing a four-zone semantics-aware fidelity field from OCR detections.
-
HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction
HalalBench is the first open multilingual OCR benchmark for food packaging, with 1,043 images across 14 languages showing that current engines achieve low F1 scores around 0.17-0.19.
-
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
-
ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering
ViTexQA is a dataset forcing multi-frame text fusion for all questions, with FrameThinker achieving 6.3% ROUGE-L gain over baselines via CoT SFT and temporally-grounded RL.
-
TextFake: Benchmarking AI-Generated Image Detection on Text-Rich Images
TextFake benchmark shows no AI-generated image detector exceeds 80% accuracy on text-rich images and identifies three failure modes including text density and rendering fidelity issues.
-
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
TRACE improves multi-video event understanding by grounding evidence in structured timelines before visual reasoning, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR 2026.
-
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.
-
Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning
A self-prompting MM-DiT model performs open-vocabulary scene text editing by extracting style and glyph information from the original image without extra encoders.
-
Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning
Self-prompting diffusion transformer uses in-context learning on self-generated prompts from the image to achieve open-vocabulary scene text editing with style consistency.
-
Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media
Creates MCD, the first benchmark dataset integrating papers, slides, videos and presentations, then evaluates embedding and vision-language models on discovering fine-grained alignments across them.
-
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
-
DocRevive: A Unified Pipeline for Document Text Restoration
A unified pipeline using OCR, inpainting, and diffusion models restores text in degraded documents on a new synthetic benchmark dataset, evaluated with the proposed UCSM metric.
-
DocRevive: A Unified Pipeline for Document Text Restoration
DocRevive builds a unified pipeline using OCR, image analysis, language models, and diffusion to reconstruct degraded document text, backed by a 30k-image synthetic dataset and the UCSM metric.
-
GUI-AC: Enhancing Continual Learning in GUI Agents
GUI-AC stabilizes RFT for non-stationary GUI data by down-weighting noisy advantages and relaxing clipping bounds via a grounding certainty term.
-
Handwriting Extraction and Analysis of Signature Lists in Swiss Popular Initiatives
Evaluation on 443 entries from 418 writers finds writer retrieval (mAP 50.6%) more robust than OCR (CER 29.6%) for identifying similar handwriting, concluding OCR is unreliable for short out-of-vocabulary names while ...
-
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
Unveil proposes a visual-textual embedding model for multi-modal documents that is distilled into an efficient visual-only retriever.
-
INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety
INSIGHT transfers 2D semantic understanding from foundation models and traditional CV tools into 3D point clouds and compressed scene graphs for indoor public-safety mapping without target-domain labels.
-
PaddleOCR 3.0 Technical Report
PaddleOCR 3.0 releases compact open-source models for OCR, document structure parsing, and information extraction that rival billion-parameter VLMs.
-
Step1X-Edit: A Practical Framework for General Image Editing
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models o...
-
PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks
PP-OCRv6 introduces three tiers of lightweight OCR models (1.5M–34.5M parameters) built on unified MetaFormer blocks with reparameterization that claim superior accuracy to PP-OCRv5 and billion-scale VLMs on in-house ...
-
Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
Describes a microservice architecture for production document AI pipelines with OCR and LLMs, reporting that OCR dominates latency and GPU inference capacity limits concurrency.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.