PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

Baohua Lai; Chenxia Li; Dianhai Yu; Kaitao Jiang; Lingfeng Zhu; Ruoyu Guo; Weiwei Liu; Xiaoguang Hu; Xiaoting Yin; Yanjun Ma

arxiv: 2206.03001 · v2 · pith:O6SMI4FQnew · submitted 2022-06-07 · 💻 cs.CV

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

Chenxia Li , Weiwei Liu , Ruoyu Guo , Xiaoting Yin , Kaitao Jiang , Yongkun Du , Yuning Du , Lingfeng Zhu

show 4 more authors

Baohua Lai Xiaoguang Hu Dianhai Yu Yanjun Ma

This is my paper

classification 💻 cs.CV

keywords modeltextpp-ocrv2pp-ocrv3systemlightweightrecognitionattention

0 comments

read the original abstract

Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1. Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and accuracy, we proposed a practical ultra lightweight OCR system (PP-OCR), and an optimized version PP-OCRv2. In order to further improve the performance of PP-OCRv2, a more robust OCR system PP-OCRv3 is proposed in this paper. PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2. For text detector, we introduce a PAN module with large receptive field named LK-PAN, a FPN module with residual attention mechanism named RSE-FPN, and DML distillation strategy. For text recognizer, the base model is replaced from CRNN to SVTR, and we introduce lightweight text recognition network SVTR LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM to accelerate the model and improve the effect. Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed. All the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities
cs.CV 2026-04 accept novelty 7.0

BMD-45 is a new large-scale CCTV vehicle detection dataset from developing cities that reveals a 2.5x performance gap for models adapted from prior benchmarks.
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
cs.CL 2026-04 unverdicted novelty 7.0

MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...
CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents
cs.CR 2026-06 unverdicted novelty 6.0

CAPED reduces incidental visual privacy leakage in mobile GUI agents from 0.766 to 0.268 on seeded AndroidWorld tasks by selectively exposing only task-relevant screen content.
Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution
cs.CV 2026-05 unverdicted novelty 6.0

ASASR recasts generative super-resolution flow into Sobolev Riemannian geometry via spectrally colored noise kernels and parametric adversaries from the Riesz Representation Theorem to enforce structural fidelity.
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
cs.CV 2026-05 unverdicted novelty 6.0

StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a ...
TextDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts
cs.CV 2026-06 unverdicted novelty 5.0

TextDS uses a data-efficient dual-encoder with SWLoRA and CSF to achieve competitive scene text detection robustness under distribution shifts and adverse conditions using 4.9M trainable parameters.
TextDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts
cs.CV 2026-06 unverdicted novelty 5.0

TextDS aligns dual visual encoders via SWLoRA and CSF for robust scene text detection under shifts, using 4.9M parameters and new adverse-condition datasets.
VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation
cs.CV 2026-05 unverdicted novelty 5.0

VaaWIT proposes DSAM and VAA modules to adapt LLMs for multilingual web image translation, claiming outperformance over open-source baselines on benchmarks.
CogVLM2: Visual Language Models for Image and Video Understanding
cs.CV 2024-08 conditional novelty 5.0

CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution
cs.CV 2026-05 unverdicted novelty 4.0

ASASR recasts generative SR flow into Sobolev Riemannian geometry via colored noise kernels and a Riesz-based parametric adversary to optimize along plausible structural failure tangents, claiming better spectral cons...
A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation
cs.CL 2026-03 conditional novelty 4.0

A proactive EMR assistant using streaming ASR and belief stabilization reaches 0.84 state-event F1, 0.87 retrieval Recall@5, and 83.3% coverage in a controlled pilot of ten doctor-patient dialogues.
PaddleOCR 3.0 Technical Report
cs.CV 2025-07 unverdicted novelty 4.0

PaddleOCR 3.0 releases compact open-source models for OCR, document structure parsing, and information extraction that rival billion-parameter VLMs.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks
cs.CV 2026-06 unverdicted novelty 3.0

PP-OCRv6 introduces three tiers of lightweight OCR models (1.5M–34.5M parameters) built on unified MetaFormer blocks with reparameterization that claim superior accuracy to PP-OCRv5 and billion-scale VLMs on in-house ...