Recognition: no theorem link
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
Pith reviewed 2026-05-15 01:11 UTC · model grok-4.3
The pith
Composing layout templates with diverse document elements and applying structure-focused training lets 1B-parameter MLLMs parse real-world documents more accurately and stably.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Realistic Scene Synthesis constructs large-scale structurally diverse full-page supervision by composing layout templates with rich document elements, while the Document-Aware Training Recipe applies progressive learning and structure-token optimization to raise structural fidelity and decoding stability, enabling superior accuracy and robustness on both scanned and casually captured documents inside a 1B-parameter MLLM.
What carries the argument
Realistic Scene Synthesis that assembles layout templates with varied document elements to produce full-page end-to-end training data, combined with Document-Aware Training that uses progressive learning stages and structure-token optimization to enforce output consistency.
If this is right
- The resulting model produces fewer hallucinations and more structurally consistent outputs on full-page inputs.
- Performance improves simultaneously on clean scanned documents and on noisy real-world photographs.
- A single 1B-parameter MLLM becomes competitive for both digital and casually captured document parsing tasks.
- Releasing the synthesis pipeline and Wild-OmniDocBench benchmark supplies new resources for further end-to-end training research.
Where Pith is reading between the lines
- The same synthesis method could be extended to generate training pairs for other structured-output multimodal tasks such as table extraction or form understanding.
- If the distribution match holds, similar template-composition techniques might reduce the need for manual annotation in other document-related vision-language benchmarks.
- Progressive structure-token training may transfer to non-document domains where output format consistency is critical, such as code generation from images.
Load-bearing premise
Composing layout templates with document elements will generate synthetic data whose noise and variability distribution closely matches that of real-world casually captured documents.
What would settle it
Training the same 1B-parameter MLLM on existing public document datasets instead of the new synthetic data and observing equal or better accuracy on the Wild-OmniDocBench real-world test set would falsify the central effectiveness claim.
Figures
read the original abstract
Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a data-training co-design framework for robust end-to-end document parsing using MLLMs. It proposes Realistic Scene Synthesis to generate large-scale, structurally diverse full-page supervision data by composing layout templates with rich document elements, paired with a Document-Aware Training Recipe using progressive learning and structure-token optimization. The work also introduces the Wild-OmniDocBench benchmark derived from real-world captured documents and reports that integration into a 1B-parameter MLLM yields superior accuracy and robustness on both scanned/digital and casually captured real-world scenarios, with all models, pipelines, and benchmarks to be publicly released.
Significance. If the empirical claims are substantiated, the framework could meaningfully advance real-world document parsing by mitigating data scarcity and structural inconsistencies in MLLM outputs, offering a scalable alternative to cascaded pipelines that fail under non-standard capture conditions. The public release of synthesis code, training recipes, and the new benchmark would provide reusable resources for the community.
major comments (2)
- [Abstract] Abstract: The headline claim of 'superior accuracy and robustness' across scanned/digital and real-world scenarios is presented without any quantitative metrics, baselines, error bars, or statistical details, which is load-bearing for the central contribution since the abstract supplies no evidence to support the stated improvements.
- [§3] §3 (Realistic Scene Synthesis): The procedure for composing layout templates is described as producing data whose variability and noise match real-world captured documents, yet no quantitative distributional comparison (e.g., Wasserstein distance on element positions/sizes, noise histogram overlap, or perceptual metrics) is reported between the generated images and the Wild-OmniDocBench real captures; this gap directly affects the validity of the robustness claim on casually photographed documents.
minor comments (1)
- [Abstract] Abstract: The statement that 'all models, data synthesis pipelines, and benchmarks will be publicly released' lacks any timeline, repository link, or licensing details.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our claims where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim of 'superior accuracy and robustness' across scanned/digital and real-world scenarios is presented without any quantitative metrics, baselines, error bars, or statistical details, which is load-bearing for the central contribution since the abstract supplies no evidence to support the stated improvements.
Authors: We agree that the abstract would benefit from explicit quantitative support for the headline claim. In the revised version, we will update the abstract to include key metrics such as the absolute accuracy improvements (e.g., +X% on Wild-OmniDocBench and +Y% on digital benchmarks) relative to strong baselines, along with a brief note on the evaluation protocol and error analysis detailed in the main body and tables. revision: yes
-
Referee: [§3] §3 (Realistic Scene Synthesis): The procedure for composing layout templates is described as producing data whose variability and noise match real-world captured documents, yet no quantitative distributional comparison (e.g., Wasserstein distance on element positions/sizes, noise histogram overlap, or perceptual metrics) is reported between the generated images and the Wild-OmniDocBench real captures; this gap directly affects the validity of the robustness claim on casually photographed documents.
Authors: We acknowledge that a direct quantitative distributional comparison would further substantiate the synthesis procedure. While the manuscript currently validates the approach via downstream performance on the real-world Wild-OmniDocBench, we will add in the revision a quantitative analysis in §3, including Wasserstein distances on element position/size distributions, noise histogram overlaps, and perceptual metrics (e.g., LPIPS or FID) between the synthetic data and real captures from the benchmark. revision: yes
Circularity Check
No circularity; empirical synthesis and benchmark evaluation
full rationale
The manuscript describes a data-training co-design: Realistic Scene Synthesis via template composition with document elements, plus a Document-Aware Training Recipe using progressive learning and structure-token optimization. No equations, parameter fittings, or derivations appear. Claims rest on training an external 1B-parameter MLLM and reporting accuracy on scanned/digital data plus the newly introduced Wild-OmniDocBench (real captures). No self-citation is invoked to justify uniqueness or to close a definitional loop; the distributional match between synthetic and real data is asserted but not reduced to a fitted quantity or self-referential definition. The result is therefore self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
Reference graph
Works this paper leans on
-
[1]
https://www.mistralocr.com/, 2025
Mistral OCR: Free Online AI OCR Tool to Extract Text. https://www.mistralocr.com/, 2025. 6
work page 2025
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Ayan Banerjee, Sanket Biswas, Josep Llad ´os, and Umapada Pal. SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis.International Journal on Document Analysis and Recognition, 27(3):317–334, 2024. 2
work page 2024
-
[6]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6
work page 2024
-
[7]
M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding. arXiv preprint arXiv:2411.04952, 2024. 1
-
[8]
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, et al. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model. arXiv preprint arXiv:2510.14528, 2025. 6
-
[9]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. PaddleOCR 3.0 Technical Report. arXiv preprint arXiv:2507.05595, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Marker.https://github.com/datalab- to/marker, 2025
Datalab. Marker.https://github.com/datalab- to/marker, 2025. 6
work page 2025
-
[11]
Context Per- ception Parallel Decoder for Scene Text Recognition.IEEE Trans
Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context Per- ception Parallel Decoder for Scene Text Recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(6):4668–4683, 2025. 1
work page 2025
-
[12]
Instruction-Guided Scene Text Recognition
Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, and Yu-Gang Jiang. Instruction-Guided Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell., 47(4):2723–2738,
-
[13]
Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059,
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document Image Parsing via Heterogeneous An- chor Prompting.arXiv preprint arXiv:2505.14059, 2025. 2, 6
-
[14]
Synthesize-Distorted-Image-and-Its- Control-Points.https : / / github
Guo Wang. Synthesize-Distorted-Image-and-Its- Control-Points.https : / / github . com / gwxie / Synthesize - Distorted - Image - and - Its - Control-Points. 4
-
[15]
Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl2: High-resolution Compressing for OCR- free Multi-page Document Understanding.arXiv preprint arXiv:2409.03420, 2024. 1, 2
-
[16]
Improving Table Structure Recognition with Visual-Alignment Sequen- tial Coordinate Modeling
Yongshuai Huang, Ning Lu, Dapeng Chen, Yibo Li, Zecheng Xie, Shenggao Zhu, Liangcai Gao, and Wei Peng. Improving Table Structure Recognition with Visual-Alignment Sequen- tial Coordinate Modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11134–11143, 2023. 2
work page 2023
-
[17]
Ocr-free document understanding transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision (ECCV), 2022. 3
work page 2022
-
[18]
CDLA: A Comprehensive Dataset for Layout- Aware Document Understanding.https://github
Lihang Li. CDLA: A Comprehensive Dataset for Layout- Aware Document Understanding.https://github. com/buptlihang/CDLA, 2024. 4
work page 2024
-
[19]
Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025. 2, 6
-
[20]
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm.arXiv preprint arXiv:2506.05218, 2025. 2, 6
-
[21]
PACM: Position-Aware Cross-Modality Decoder for Handwritten Mathematical Expression Recognition
Zeng Li, Jin Wei, Zhijie Shen, Can Ma, Yaqiang Wu, and Yu Zhou. PACM: Position-Aware Cross-Modality Decoder for Handwritten Mathematical Expression Recognition. In International Conference on Document Analysis and Recog- nition, pages 96–114. Springer, 2025. 1
work page 2025
-
[22]
Mask textspotter v3: Segmentation proposal net- work for robust scene text spotting
Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xi- ang Bai. Mask textspotter v3: Segmentation proposal net- work for robust scene text spotting. InProceedings of the European conference on computer vision, pages 706–722. Springer, 2020. 2
work page 2020
-
[23]
Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus Anywhere for Fine-grained Multi-page Document Understanding.arXiv preprint arXiv:2405.14295, 2024. 3
-
[24]
Visual Instruction Tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3
work page 2023
-
[25]
A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025
Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanx- uan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025. 4
-
[26]
ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network
Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. Inproceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9809–9818, 2020. 2
work page 2020
-
[27]
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, et al. POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025. 6
-
[28]
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15630–15640, 2024. 2
work page 2024
-
[29]
Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance
Jiahao Lyu, Wei Wang, Dongbao Yang, Jinwen Zhong, and Yu Zhou. Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5919–5927, 2025. 1
work page 2025
-
[30]
Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME+ TFD: Competition on Recognition of Handwrit- ten Mathematical Expressions and Typeset Formula Detec- tion. InInternational Conference on Document Analysis and Recognition, pages 1533–1538. IEEE, 2019. 3
work page 2019
-
[31]
Mathpix Snip: Convert images and PDFs to La- TeX, DOCX, and more.https://mathpix.com/,
Mathpix. Mathpix Snip: Convert images and PDFs to La- TeX, DOCX, and more.https://mathpix.com/,
-
[32]
MinerU2.5: A Decoupled Vision-Language Model for Efficient High- Resolution Document Parsing, 2025
Junbo Niu, Zheng Liu, Zhuangcheng Gu, et al. MinerU2.5: A Decoupled Vision-Language Model for Efficient High- Resolution Document Parsing, 2025. 6
work page 2025
-
[33]
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Com- prehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Com- prehensive Annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 2, 3, 5
work page 2025
-
[34]
DocLayNet: A Large Human- Annotated Dataset for Document-Layout Analysis
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter W J Staar. DocLayNet: A Large Human- Annotated Dataset for Document-Layout Analysis. 2022. 4
work page 2022
-
[35]
Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025. 6
-
[36]
Assisting in writing Wikipedia-like articles from scratch with large language models
Yijia Shao, Yucheng Jiang, et al. Assisting in writing Wikipedia-like articles from scratch with large language models. pages 6252–6278, Mexico City, Mexico, 2024. As- sociation for Computational Linguistics. 1
work page 2024
-
[37]
Divide Rows and Con- quer Cells: Towards Structure Recognition for Large Tables
Huawen Shen, Xiang Gao, Jin Wei, Liang Qiao, Yu Zhou, Qiang Li, and Zhanzhan Cheng. Divide Rows and Con- quer Cells: Towards Structure Recognition for Large Tables. InInternational Joint Conferences on Artificial Intelligence, pages 1369–1377, 2023. 2
work page 2023
-
[38]
LDP: Generalizing to Multilingual Visual Information Ex- traction by Language Decoupled Pretraining
Huawen Shen, Gengluo Li, Jinwen Zhong, and Yu Zhou. LDP: Generalizing to Multilingual Visual Information Ex- traction by Language Decoupled Pretraining. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6805–6813, 2025. 1
work page 2025
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A Family of Highly Capable Multimodal Models.arXiv preprint arXiv:2312.11805, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, et al. Hunyuanocr Technical Report.arXiv preprint arXiv:2511.19575, 2025. 2
-
[41]
Qwen2.5: A Party of Foundation Models
Qwen Team. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/,
-
[42]
pdfT EX: A TeX extension for direct PDF output.http://www.pdftex.org/, 2020
Han The Thanh. pdfT EX: A TeX extension for direct PDF output.http://www.pdftex.org/, 2020. 4
work page 2020
-
[43]
An-Lan Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: How Far Are We from Achieving Compre- hensive and Robust Document Understanding in the Wild? InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 23002–23012,
work page 2025
-
[44]
Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024. 3
-
[45]
MinerU: An Open-Source Solution for Precise Document Content Extraction, 2024
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, et al. MinerU: An Open-Source Solution for Precise Document Content Extraction, 2024. 1, 6
work page 2024
-
[46]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. InternVL3.5: Advancing Open- Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Vary: Scaling up the Vision V ocabulary for Large Vision-Language Models
Haoran Wei, Lingyu Kong, et al. Vary: Scaling up the Vision V ocabulary for Large Vision-Language Models. InEuropean Conference on Computer Vision, pages 408–424. Springer,
-
[48]
Haoran Wei, Chenglong Liu, et al. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024. 2, 3, 6, 7
-
[49]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei et al. DeepSeek-OCR: Contexts Optical Com- pression.arXiv preprint arXiv:2510.18234, 2025. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective Long-Context Scaling of Foundation Models.arXiv preprint arXiv:2309.16039, 2023. 4
-
[51]
XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding
Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding. InFindings of the Association for Computa- tional Linguistics, pages 3214–3224, Dublin, Ireland, 2022. Association for Computational Linguistics. 5
work page 2022
-
[52]
Xiaomeng Yang, Zhi Qiao, and Yu Zhou. IPAD: Itera- tive, Parallel, and Diffusion-based Network for Scene Text Recognition.International Journal of Computer Vision, 133 (8):5589–5609, 2025. 1
work page 2025
-
[53]
Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Yuliang Liu, et al. CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multi- modal Models in Literacy.arXiv preprint arXiv:2412.02210,
-
[54]
Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-Aware Network for Handwritten Mathematical Expression Recognition.arXiv preprint arXiv:2203.01601, 2022. 3
-
[55]
CoMER: Modeling Cov- erage for Transformer-based Handwritten Mathematical Ex- pression Recognition
Wenqi Zhao and Liangcai Gao. CoMER: Modeling Cov- erage for Transformer-based Handwritten Mathematical Ex- pression Recognition. InEuropean conference on computer vision, pages 392–408, 2022. 2
work page 2022
-
[56]
Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer
Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, and Ziyin Zhang. Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer. InIn- ternational Conference on Document Analysis and Recogni- tion, pages 570–584, 2021. 2
work page 2021
-
[57]
Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adap- tive Perception, 2024. 1, 3, 4
work page 2024
-
[58]
CDistNet: Perceiving Multi- Domain Character Distance for Robust Text Recognition
Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, and Yu-Gang Jiang. CDistNet: Perceiving Multi- Domain Character Distance for Robust Text Recognition. International Journal of Computer Vision, 132(2):300–318,
-
[59]
Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context.Winter Conference for Applications in Computer Vision, 2021. 3
work page 2021
-
[60]
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and eval- uation.arXiv preprint arXiv:1911.10683, 2019. 3
-
[61]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Mod- els.arXiv preprint arXiv:2504.10479, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.