Recognition: unknown
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3
The pith
PaddleOCR-VL filters redundant image regions with a lightweight module so a 0.9B vision-language model can parse documents at state-of-the-art accuracy using far fewer tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaddleOCR-VL uses a coarse-to-fine pipeline in which the Valid Region Focus Module first identifies valid vision tokens via localization and contextual prediction, then routes only those tokens to a trained 0.9B vision-language model that performs detailed document recognition without processing the entire high-resolution image.
What carries the argument
The Valid Region Focus Module (VRFM), a lightweight network that selects semantically relevant vision tokens from document images before they reach the recognition model.
If this is right
- Page-level parsing and element-level recognition both reach state-of-the-art scores.
- The system uses substantially fewer vision tokens and parameters than competing vision-language models.
- Inference speed improves while maintaining or exceeding accuracy of larger models.
- Targeted filtering of visual input proves effective for fine-grained document tasks.
Where Pith is reading between the lines
- The same filtering step could be applied to other high-resolution vision tasks to reduce token budgets without retraining the main model.
- Small domain-specific models guided by coarse selection may close the gap with much larger general-purpose models in structured visual domains.
- Further gains are possible if the Valid Region Focus Module is jointly trained end-to-end with the recognition model rather than used as a fixed preprocessor.
Load-bearing premise
Redundant background regions in documents can be identified reliably enough that the small model still recovers full accuracy from the remaining tokens.
What would settle it
Accuracy on a held-out document set drops sharply when the Valid Region Focus Module is forced to discard regions that contain critical text or layout elements.
Figures
read the original abstract
Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaddleOCR-VL, a coarse-to-fine document parsing architecture consisting of a lightweight Valid Region Focus Module (VRFM) that identifies semantically relevant image regions via localization and contextual prediction, followed by a compact 0.9B vision-language model (PaddleOCR-VL-0.9B) that performs detailed recognition on the filtered token set. It claims state-of-the-art results on page-level parsing and element-level recognition tasks, with substantial gains in inference speed and reductions in vision tokens and parameters relative to existing solutions and larger VLMs.
Significance. If the central performance claims hold under rigorous verification, the work offers a practical advance in efficient high-resolution document understanding by mitigating quadratic token scaling. The public release of code and models at https://github.com/PaddlePaddle/PaddleOCR further supports reproducibility and potential adoption in real-world OCR pipelines.
major comments (2)
- [Valid Region Focus Module and Experiments] The headline SOTA and efficiency claims rest on the untested assumption that VRFM correctly retains all semantically critical tokens (small text, tables, signatures, low-contrast elements) while discarding only redundant background. The manuscript reports only aggregate page- and element-level metrics; it does not quantify recall on critical sub-regions or present failure cases on dense or low-contrast layouts where filtering could cause irrecoverable accuracy drops.
- [Abstract and Results] The abstract asserts SOTA performance and strong competitiveness against top-tier VLMs without providing any quantitative metrics, baseline tables, or ablation results. This omission makes it impossible to assess the magnitude of improvement or rule out post-hoc tuning, and the full experimental section must supply these details for the central claim to be verifiable.
minor comments (2)
- [Method] Notation for the VRFM output (filtered token set) and its integration with the 0.9B model could be clarified with an explicit diagram or pseudocode to improve reproducibility.
- [Discussion] The paper should include a limitations section discussing scenarios where VRFM filtering might degrade performance (e.g., highly cluttered documents).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Valid Region Focus Module and Experiments] The headline SOTA and efficiency claims rest on the untested assumption that VRFM correctly retains all semantically critical tokens (small text, tables, signatures, low-contrast elements) while discarding only redundant background. The manuscript reports only aggregate page- and element-level metrics; it does not quantify recall on critical sub-regions or present failure cases on dense or low-contrast layouts where filtering could cause irrecoverable accuracy drops.
Authors: We agree that granular validation of VRFM is valuable. While the reported aggregate metrics on page- and element-level tasks already demonstrate strong end-to-end performance, we will add in the revision: quantitative recall measurements for critical sub-regions (small text, tables, signatures, low-contrast elements) and an analysis of failure cases on dense or low-contrast layouts. These additions will directly address the concern about potential irrecoverable accuracy drops. revision: yes
-
Referee: [Abstract and Results] The abstract asserts SOTA performance and strong competitiveness against top-tier VLMs without providing any quantitative metrics, baseline tables, or ablation results. This omission makes it impossible to assess the magnitude of improvement or rule out post-hoc tuning, and the full experimental section must supply these details for the central claim to be verifiable.
Authors: We acknowledge that the abstract currently lacks specific numbers. In the revised manuscript we will update the abstract to include key quantitative results (accuracy gains, token reduction, parameter efficiency) together with explicit references to the baseline tables and ablations already present in the experimental section. This will make the magnitude of improvement directly verifiable from the abstract. revision: yes
Circularity Check
No circularity in empirical architecture derivation
full rationale
The paper presents PaddleOCR-VL as an empirical coarse-to-fine architecture: a lightweight VRFM identifies valid tokens via localization and context prediction, after which a separately trained 0.9B VLM performs recognition on the filtered input. No equations, fitted-parameter predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmarks and experiments rather than any internal reduction to the method's own inputs. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-resolution document images contain substantial redundant visual regions such as background that can be safely suppressed.
invented entities (2)
-
Valid Region Focus Module (VRFM)
no independent evidence
-
PaddleOCR-VL-0.9B
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5, 7, 8, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Ernie 4.5 technical report, 2025
Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025. 1, 4, 5, 2
2025
-
[5]
Beagle: Au- tomated extraction and interpretation of visualizations from the web
Leilani Battle, Peitong Duan, Zachery Miranda, Dana Muku- sheva, Remco Chang, and Michael Stonebraker. Beagle: Au- tomated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI conference on hu- man factors in computing systems, pages 1–8, 2018. 5, 3
2018
-
[6]
Ocrflux.https : / / github
chatdoc com. Ocrflux.https : / / github . com / chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25. 7
2025
-
[7]
Onechart: Purify the chart structural extrac- tion via one auxiliary token
Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xi- angyu Zhang. Onechart: Purify the chart structural extrac- tion via one auxiliary token. InProceedings of the 32nd ACM International Conference on Multimedia, pages 147– 155, 2024. 8
2024
-
[8]
Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, and Chi Zhang. Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025. 1
-
[9]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1, 3, 5, 7, 8, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Icdar 2019 competition on harvesting raw tables from infographics (chart-infographics)
Kenny Davila, Bhargava Urala Kota, Srirangaraj Setlur, Venu Govindaraju, Christopher Tensmeyer, Sumit Shekhar, and Ritwick Chaudhry. Icdar 2019 competition on harvesting raw tables from infographics (chart-infographics). In2019 International Conference on Document Analysis and Recog- nition (ICDAR), pages 1594–1599. IEEE, 2019. 5, 3
2019
-
[11]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023. 4
2023
-
[12]
Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1 b: Unified text and formula recognition with 0.1 b parameters.arXiv preprint arXiv:2512.21095, 2025. 3
-
[13]
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025. 1, 3, 7, 8
-
[14]
Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024. 4
-
[15]
Mathwriting: A dataset for handwritten mathematical ex- pression recognition
Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical ex- pression recognition. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5459–5469, 2025. 5, 3
2025
-
[16]
Gemini 2.5.https : / / blog
Google DeepMind. Gemini 2.5.https : / / blog . google / technology / google - deepmind/gemini- model- thinking- updates- march-2025/, 2025. 1, 3, 7
2025
-
[17]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Relation detr: Exploring explicit position relation prior for object detection
Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen, and Xuguang Lan. Relation detr: Exploring explicit position relation prior for object detection. InEuropean Con- ference on Computer Vision, pages 89–105. Springer, 2024. 5
2024
-
[19]
Dvqa: Understanding data visualizations via ques- tion answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,
-
[20]
Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart sum- marization.arXiv preprint arXiv:2203.06486, 2022. 5, 3
-
[21]
Binary coors capable or ‘correcting dele- tions, insertions, and reversals
VI Lcvenshtcin. Binary coors capable or ‘correcting dele- tions, insertions, and reversals. InSoviet physics-doklady,
-
[22]
Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 1
2020
-
[23]
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025. 1, 3, 7, 8, 5, 6
-
[24]
Casia online and offline chinese handwriting databases
Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. In 2011 international conference on document analysis and recognition, pages 37–41. IEEE, 2011. 5
2011
-
[25]
Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table trans- lation.arXiv preprint arXiv:2212.10505, 2022. 5, 8
-
[26]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3, 4
2023
-
[27]
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xub- ing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025. 1, 4, 7
-
[28]
pix2tex - latex ocr.https://github
Lukas Blecher. pix2tex - latex ocr.https://github. com/lukas-blecher/LaTeX-OCR, 2022. Accessed: 2025-06-23. 3
2022
-
[29]
Chartocr: Data extraction from charts images via a deep hy- brid framework
Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. Chartocr: Data extraction from charts images via a deep hy- brid framework. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1917– 1925, 2021. 5, 3
1917
-
[30]
Optimized table tokeniza- tion for table structure recognition
Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokeniza- tion for table structure recognition. InInternational Confer- ence on Document Analysis and Recognition, pages 37–50. Springer, 2023. 6, 1
2023
-
[31]
Icdar 2019 crohme+ tfd: Competition on recognition of handwritten mathematical expressions and typeset formula detection
Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. Icdar 2019 crohme+ tfd: Competition on recognition of handwritten mathematical expressions and typeset formula detection. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1533–1538. IEEE, 2019. 3
2019
-
[32]
Nanonets-ocr-s: A model for trans- forming documents into structured markdown with intelli- gent content recognition and semantic tagging, 2025
Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for trans- forming documents into structured markdown with intelli- gent content recognition and semantic tagging, 2025. 7
2025
-
[33]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5, 3
-
[34]
Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Ena- mul Hoque, and Shafiq Joty. Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning.arXiv preprint arXiv:2305.14761, 2023. 5, 3
-
[35]
Plotqa: Reasoning over scientific plots
Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. InProceedings of the ieee/cvf winter conference on appli- cations of computer vision, pages 1527–1536, 2020. 5, 3
2020
-
[36]
Mistral-ocr.https://mistral.ai/ news / mistral - ocr ? utm _ source = ai - bot
Mistral AI Team. Mistral-ocr.https://mistral.ai/ news / mistral - ocr ? utm _ source = ai - bot . cn,
-
[37]
Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014)
Harold Mouchere, Christian Viard-Gaudin, Richard Zanibbi, and Utpal Garain. Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014). In2014 14th international conference on frontiers in handwriting recognition, pages 791–796. IEEE, 2014. 3
2014
-
[38]
Icfhr2016 crohme: Competition on recog- nition of online handwritten mathematical expressions
Harold Mouch `ere, Christian Viard-Gaudin, Richard Zanibbi, and Utpal Garain. Icfhr2016 crohme: Competition on recog- nition of online handwritten mathematical expressions. In 2016 15th International Conference on Frontiers in Hand- writing Recognition (ICFHR), pages 607–612. IEEE, 2016. 3
2016
- [39]
-
[40]
Mineru2.0-2505-0.9b.https : / / huggingface
opendatalab. Mineru2.0-2505-0.9b.https : / / huggingface . co / opendatalab / MinerU2 . 0 - 2505-0.9B, 2025. 1, 3, 4, 7, 8, 5
2025
-
[41]
Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 24838–24848, 2025. 2, 6, 7
2025
-
[42]
Erniekit.https://github
PaddlePaddle Authors. Erniekit.https://github. com/PaddlePaddle/ERNIE, 2025. 6
2025
-
[43]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
-
[44]
Marker.https : / / github
Vik Paruchuri. Marker.https : / / github . com / datalab-to/marker, 2025. Accessed: 2025-09-25. 7
2025
-
[45]
Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 7
-
[46]
dots.ocr: Multilingual document layout pars- ing in a single vision-language model, 2025
rednote-hilab. dots.ocr: Multilingual document layout pars- ing in a single vision-language model, 2025. 1, 3, 7, 8, 5
2025
-
[47]
Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp- doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025. 1, 5, 4
-
[48]
Vistext: A benchmark for semantically rich chart captioning
Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023. 5, 3
-
[49]
Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chen- glong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025. 4
-
[50]
Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition. arXiv preprint arXiv:2404.15254, 2024. 5, 3
-
[51]
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024. 1, 3
-
[52]
Image over text: Transforming formula recognition evalua- tion with character detection matching
Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evalua- tion with character detection matching. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, 2025. 8
2025
-
[53]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,
-
[55]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Syntax-aware network for handwritten mathematical expression recognition
Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-aware network for handwritten mathematical expression recognition. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4553–4562, 2022. 3
2022
-
[58]
Dockylin: A large multimodal model for visual document understanding with efficient visual slim- ming
Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, and Lianwen Jin. Dockylin: A large multimodal model for visual document understanding with efficient visual slim- ming. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9923–9932, 2025. 4
2025
-
[59]
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning.arXiv preprint arXiv:2404.16635,
-
[60]
Generalized cross entropy loss for training deep neural networks with noisy labels.Ad- vances in neural information processing systems, 31, 2018
Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Ad- vances in neural information processing systems, 31, 2018. 6
2018
-
[61]
Detrs beat yolos on real-time object detection
Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 4, 5
2024
-
[62]
Image-based table recognition: data, model, and evaluation, 2020
Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation, 2020. 5, 7
2020
-
[63]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 7 Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing Supp...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.