PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

Changda Zhou; Cheng Cui; Dianhai Yu; Hongen Liu; Jiaxuan Liu; Manhui Lin; Penglongyi Deng; Suyin Liang; Tingquan Gao; Ting Sun

arxiv: 2606.13108 · v1 · pith:IXF3K2KHnew · submitted 2026-06-11 · 💻 cs.CV

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

Yubo Zhang , Xueqing Wang , Manhui Lin , Yue Zhang , Penglongyi Deng , Ting Sun , Tingquan Gao , Zelun Zhang

show 8 more authors

Jiaxuan Liu Changda Zhou Hongen Liu Suyin Liang Cheng Cui Yi Liu Dianhai Yu Yanjun Ma

This is my paper

Pith reviewed 2026-06-27 07:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords OCRtext detectiontext recognitionlightweight neural networksMetaFormerstructural reparameterizationvision language models

0 comments

The pith

PP-OCRv6 redesigns OCR architecture with a unified MetaFormer block to outperform large VLMs using far fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces PP-OCRv6, a lightweight OCR system built around a single MetaFormer-style building block that handles both detection and recognition. The models range from 1.5 million to 34.5 million parameters and are shown to exceed the performance of previous PP-OCR versions and even billion-scale vision-language models on the authors' benchmarks. The design uses structural reparameterization to separate spatial and channel operations while allowing task-specific adjustments. If the results hold, it demonstrates that purpose-built OCR pipelines can deliver better efficiency and accuracy than scaling up general-purpose models.

Core claim

The central claim is that a unified MetaFormer-style building block with structural reparameterization, applied across backbone, detection neck, and recognition neck with task-specific strides, enables PP-OCRv6 models to achieve 83.2% recognition accuracy and 86.2% detection Hmean on in-house benchmarks, surpassing PP-OCRv5 and large VLMs like Qwen3-VL-235B with orders of magnitude fewer parameters.

What carries the argument

Unified MetaFormer-style building block with structural reparameterization that decouples spatial token mixing from channel mixing.

If this is right

PP-OCRv6_medium outperforms PP-OCRv5_server by 5.1 percentage points in recognition accuracy.
PP-OCRv6_medium outperforms PP-OCRv5_server by 4.6 percentage points in detection Hmean.
The tiny model variant runs 3.9 times faster than PP-OCRv5_mobile on Intel Xeon CPU at comparable accuracy.
Models in this family surpass Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro on the evaluated OCR tasks despite using far fewer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that architectural specialization for OCR can be more effective than increasing model scale in general VLMs.
Deployment on edge devices becomes more feasible with the tiny and small tiers.
Similar block designs could be adapted for other specialized vision tasks beyond OCR.

Load-bearing premise

The in-house benchmarks accurately represent real-world OCR performance and that the VLM comparisons were done under matching evaluation conditions.

What would settle it

Running the PP-OCRv6 models on widely used public OCR datasets such as ICDAR 2015 or Total-Text and comparing results directly to the cited VLMs under identical protocols.

read the original abstract

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PP-OCRv6 adds a reusable MetaFormer block and reparameterization to the prior series, but the headline claim of beating large VLMs rests entirely on private benchmarks whose construction and VLM evaluation protocol are not described.

read the letter

The paper's real contribution is the engineering detail on a single block that handles both detection and recognition through task-specific strides and structural reparameterization. This lets them ship three sizes from the same primitives and report faster CPU inference for the tiny version while holding accuracy close to the previous mobile model. That part looks like straightforward, reproducible progress on the PP-OCR line.

The central claim, however, is that the medium model beats Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro on recognition and detection. All numbers come from the authors' in-house sets, with no description of how the VLMs were prompted, how outputs were parsed, or whether the test distribution favors narrow OCR pipelines. Without those details the margins cannot be interpreted as evidence of architectural superiority.

The rest of the work is incremental: the block design is a variation on existing MetaFormer ideas, the data-centric optimizations are mentioned but not quantified separately, and no public datasets or code are referenced. The tiny model's speed claim is the only result that could be checked independently.

This is useful reading for teams that already run PP-OCR in production and want the next internal version. It does not yet supply the evidence needed for a general claim about small models versus billion-parameter VLMs. I would not bring it to a reading group or cite it until the benchmark protocol is public. A serious editor should desk-reject unless the authors add the missing evaluation details.

Referee Report

1 major / 0 minor

Summary. The paper introduces PP-OCRv6, a family of lightweight OCR models (medium, small, tiny) built around a unified MetaFormer-style block with structural reparameterization that decouples spatial and channel mixing and uses task-specific strides. On in-house benchmarks, PP-OCRv6_medium is reported to reach 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% while surpassing Qwen3-VL-235B, GPT-5.5 and Gemini-3.1-Pro with orders-of-magnitude fewer parameters; the tiny variant is also claimed to be 3.9× faster than PP-OCRv5_mobile on Intel Xeon CPU.

Significance. If the performance margins can be shown to arise from the architectural choices rather than benchmark construction, the work would establish that compact, task-specialized OCR pipelines can exceed general VLMs on OCR metrics at far lower cost. The shared MetaFormer primitives across detection/recognition and model scales, together with reparameterization, constitute a concrete engineering contribution for edge and server deployment.

major comments (1)

[Abstract] Abstract: the central claim that PP-OCRv6_medium surpasses Qwen3-VL-235B, GPT-5.5 and Gemini-3.1-Pro rests on comparisons performed exclusively on in-house benchmarks. No information is supplied on dataset composition, train/test splits, annotation protocol, or the precise prompting and output-parsing procedures applied to the VLMs; without these details the reported +5.1% / +4.6% margins cannot be verified as reflecting model superiority rather than benchmark bias.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. The concern about insufficient documentation of the in-house benchmarks and VLM evaluation protocol is valid, and we will revise the manuscript accordingly to improve transparency while respecting the proprietary nature of the data.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that PP-OCRv6_medium surpasses Qwen3-VL-235B, GPT-5.5 and Gemini-3.1-Pro rests on comparisons performed exclusively on in-house benchmarks. No information is supplied on dataset composition, train/test splits, annotation protocol, or the precise prompting and output-parsing procedures applied to the VLMs; without these details the reported +5.1% / +4.6% margins cannot be verified as reflecting model superiority rather than benchmark bias.

Authors: We agree that the manuscript lacks sufficient detail on the in-house benchmarks and VLM evaluation setup. In the revised version we will add a new subsection in the Experiments section that describes (i) high-level dataset composition and domain coverage, (ii) train/test split statistics and annotation guidelines, and (iii) the exact prompting templates together with the deterministic output-parsing rules applied to Qwen3-VL-235B, GPT-5.5 and Gemini-3.1-Pro. Full release of the raw data remains impossible for proprietary reasons, but the added protocol description will allow readers to judge whether the reported margins reflect architectural merit rather than benchmark construction. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance claims with no derivations

full rationale

The paper contains no equations, derivations, or predictions that reduce to fitted inputs or self-citations. All reported results (83.2% recognition accuracy, 86.2% detection Hmean on in-house benchmarks, comparisons to VLMs) are direct empirical measurements. Architectural descriptions (MetaFormer-style blocks, reparameterization, task-specific strides) are presented as design choices without any claim that performance follows from a mathematical reduction to those choices. The in-house benchmark limitation affects verifiability but does not create circularity in any derivation chain. This is a standard empirical model paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described beyond the high-level mention of a MetaFormer-style block.

pith-pipeline@v0.9.1-grok · 5814 in / 1134 out tokens · 23803 ms · 2026-06-27T07:25:10.354675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 linked inside Pith

[1]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[2]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[3]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020

arXiv 2009
[4]

Pp-ocrv2: Bag of tricks for ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Cheng Cui, Weiwei Liu, Jun Zhou, Bin Lu, Yehua Yang, Qiwen Liu, Xiaoguang Hu, et al. Pp-ocrv2: Bag of tricks for ultra lightweight ocr system. arXiv preprint arXiv:2109.03144, 2021

arXiv 2021
[5]

Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system

Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022

arXiv 2022
[6]

Pp-ocrv5: A specialized 5m-parameter model rivaling billion- parameter vision-language models on ocr tasks

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, et al. Pp-ocrv5: A specialized 5m-parameter model rivaling billion- parameter vision-language models on ocr tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2467–2476, 2026

2026
[7]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022

2022
[8]

Repvit: Revisiting mobile cnn from vit perspective

Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15909–15920, 2024

2024
[9]

Pp-lcnet: A lightweight cpu convolutional neural network

Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, et al. Pp-lcnet: A lightweight cpu convolutional neural network. arXiv preprint arXiv:2109.15099, 2021

arXiv 2021
[10]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

2018
[11]

Repvgg: Making vgg-style convnets great again

Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021

2021
[12]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

2018
[13]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

2017
[14]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016

2016
[15]

Unire- plknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, and Ying Shan. Unire- plknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5513–5524, 2024. 15

2024
[16]

Deeply-supervised nets

Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. Pmlr, 2015

2015
[17]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016

2016
[18]

Gtc: Guided training of ctc towards efficient and accurate scene text recognition

Wenyang Hu, Xiaocong Cai, Jun Hou, Shuai Yi, and Zhiping Lin. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11005–11012, 2020

2020
[19]

Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition

Fenfen Sheng, Zhineng Chen, and Bo Xu. Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International conference on document analysis and recognition (ICDAR), pages 781–786. IEEE, 2019. 16 Appendix A. Language Support Table 10 compares language coverage across PP-OCR versions. PP-OCRv3/v4 support only Simplified Chinese...

2019

[1] [1]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[2] [2]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[3] [3]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020

arXiv 2009

[4] [4]

Pp-ocrv2: Bag of tricks for ultra lightweight ocr system

Yuning Du, Chenxia Li, Ruoyu Guo, Cheng Cui, Weiwei Liu, Jun Zhou, Bin Lu, Yehua Yang, Qiwen Liu, Xiaoguang Hu, et al. Pp-ocrv2: Bag of tricks for ultra lightweight ocr system. arXiv preprint arXiv:2109.03144, 2021

arXiv 2021

[5] [5]

Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system

Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022

arXiv 2022

[6] [6]

Pp-ocrv5: A specialized 5m-parameter model rivaling billion- parameter vision-language models on ocr tasks

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, et al. Pp-ocrv5: A specialized 5m-parameter model rivaling billion- parameter vision-language models on ocr tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2467–2476, 2026

2026

[7] [7]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022

2022

[8] [8]

Repvit: Revisiting mobile cnn from vit perspective

Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15909–15920, 2024

2024

[9] [9]

Pp-lcnet: A lightweight cpu convolutional neural network

Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, et al. Pp-lcnet: A lightweight cpu convolutional neural network. arXiv preprint arXiv:2109.15099, 2021

arXiv 2021

[10] [10]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

2018

[11] [11]

Repvgg: Making vgg-style convnets great again

Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021

2021

[12] [12]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

2018

[13] [13]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

2017

[14] [14]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016

2016

[15] [15]

Unire- plknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, and Ying Shan. Unire- plknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5513–5524, 2024. 15

2024

[16] [16]

Deeply-supervised nets

Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. Pmlr, 2015

2015

[17] [17]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016

2016

[18] [18]

Gtc: Guided training of ctc towards efficient and accurate scene text recognition

Wenyang Hu, Xiaocong Cai, Jun Hou, Shuai Yi, and Zhiping Lin. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11005–11012, 2020

2020

[19] [19]

Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition

Fenfen Sheng, Zhineng Chen, and Bo Xu. Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International conference on document analysis and recognition (ICDAR), pages 781–786. IEEE, 2019. 16 Appendix A. Language Support Table 10 compares language coverage across PP-OCR versions. PP-OCRv3/v4 support only Simplified Chinese...

2019