An LMM for Precisely Grounding Elements in Documents
Pith reviewed 2026-06-26 01:49 UTC · model grok-4.3
The pith
PreciseDoc is an LMM that grounds document elements precisely by training on synthetic hand-filled documents with coordinate metadata and using reinforcement learning to jointly supervise grounding and reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PreciseDoc is an LMM specifically designed for precise element grounding in documents. It uses two synthetic data generation pipelines to create high-quality training examples with fine-grained coordinates, including hand-filled documents with camera effects. A training paradigm jointly supervises grounding and reasoning with reinforcement learning. This results in capabilities like locating personal information from CVs and better performance on document spatial grounding and understanding benchmarks.
What carries the argument
Synthetic data generation pipelines that produce documents paired with fine-grained coordinate metadata, combined with a reinforcement learning training paradigm that jointly optimizes grounding and reasoning.
If this is right
- The model develops real-world functions such as locating personal information from CVs beyond single-text localization.
- Joint reinforcement learning supervision increases the contribution of grounded evidence to overall reasoning.
- The approach enables further optimization specifically for Document VQA tasks.
- Comprehensive benchmark evaluations demonstrate advantages in both document spatial grounding and document understanding.
Where Pith is reading between the lines
- Accurate grounding could directly support document error detection by supplying reliable element locations for verification.
- Synthetic data pipelines may address scarcity issues in training data for other multimodal grounding problems.
- Joint grounding-reasoning training might reduce reliance on ungrounded inferences in document-based chains of reasoning.
- The method could integrate into production document processing pipelines where precise spatial references matter.
Load-bearing premise
The synthetic data generation pipelines produce document images whose distribution is close enough to real-world cases that benchmark gains transfer to actual use.
What would settle it
Evaluating the model on a large set of real-world scanned documents with manual ground-truth coordinates and comparing precision to existing LMMs; if no improvement or degradation occurs, the claim fails.
Figures
read the original abstract
Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PreciseDoc, an LMM for precise element grounding in document images. It constructs training data via two synthetic pipelines (hand-filled documents plus camera effects) that produce paired fine-grained coordinate metadata, and proposes a reinforcement-learning training paradigm that jointly supervises grounding and reasoning. The central claim is that these data and methods yield advantages in document spatial grounding and document understanding, as shown by comprehensive evaluation on various benchmarks.
Significance. If the reported benchmark gains are real and the synthetic data generalizes, the work would address a recognized weakness of current LMMs on text-rich documents and supply a scalable route to high-precision localization metadata. The joint RL supervision of grounding and reasoning is a plausible direction for improving evidence-based reasoning.
major comments (2)
- [Abstract] Abstract: the assertion that 'a comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods' supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the central empirical claim cannot be evaluated.
- [Abstract] Abstract (synthetic data pipelines): no distribution-distance metric, realism score, or ablation on real vs. synthetic test splits is reported to verify that the hand-filled documents plus camera effects reproduce the statistics of the real document images appearing in the held-out benchmarks. This is load-bearing for the transfer claim.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract. We will revise the abstract to better support the central claims with quantitative highlights from the full manuscript while preserving its concise nature. The full paper already contains the requested evaluations, baselines, and ablations in the experiments section.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'a comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods' supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the central empirical claim cannot be evaluated.
Authors: We agree the abstract would be strengthened by including key quantitative results. The manuscript body reports comprehensive evaluations across multiple benchmarks, including specific metrics (e.g., grounding precision, VQA accuracy), baseline comparisons, ablations on data pipelines and RL training, and error analyses. We will revise the abstract to incorporate representative numbers, baseline references, and mention of ablations to make the empirical claim self-contained at the summary level. revision: yes
-
Referee: [Abstract] Abstract (synthetic data pipelines): no distribution-distance metric, realism score, or ablation on real vs. synthetic test splits is reported to verify that the hand-filled documents plus camera effects reproduce the statistics of the real document images appearing in the held-out benchmarks. This is load-bearing for the transfer claim.
Authors: The synthetic pipelines are explicitly designed to produce realistic variations via hand-filled content and camera effects to bridge the domain gap. While the current manuscript does not report explicit distribution-distance metrics (e.g., FID) or dedicated real-vs-synthetic ablations, the transfer is validated empirically through strong performance gains on held-out real-world benchmarks. We will add a brief discussion in the revised manuscript on the design rationale for realism and note the empirical evidence of generalization; a full quantitative distribution analysis would require additional experiments beyond the current scope. revision: partial
Circularity Check
No circularity: empirical modeling contribution with independent evaluation
full rationale
The paper describes construction of synthetic training data via two pipelines, a joint grounding+reasoning training paradigm using reinforcement learning, and benchmark evaluations. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described content. Claims rest on experimental results rather than any derivation chain that reduces to its own inputs by construction. This is a standard empirical LMM paper; the reader's assessment of score 2 is consistent with minor self-citation tolerance but 0 is appropriate here as no circular patterns are exhibited.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Synthetic hand-filled documents with camera effects approximate the distribution of real document images
- domain assumption Joint reinforcement learning of grounding and reasoning improves the contribution of grounded evidence
Reference graph
Works this paper leans on
-
[1]
Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025
-
[2]
Revisiting referring expression comprehension evaluation in the era of large multimodal models
Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 513–524, 2025
2025
-
[3]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
2024
-
[4]
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026. URLhttps://arxiv.org/abs/2601.21957
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd.Advances in Neural Information Processing Systems, 37:42566–42592, 2024
2024
-
[6]
GRIT: Teaching MLLMs to Think with Images
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023
-
[9]
Boundingdocs: a unified dataset for document question answering with spatial annotations: S
Simone Giovannini, Fabio Coppini, Andrea Gemelli, and Simone Marinai. Boundingdocs: a unified dataset for document question answering with spatial annotations: S. giovannini et al. International Journal on Document Analysis and Recognition (IJDAR), pages 1–16, 2025
2025
-
[10]
Nist special database 19 handprinted forms and characters database
Patrick Grother. Nist special database 19 handprinted forms and characters database. 1995. URLhttps://api.semanticscholar.org/CorpusID:59785963
1995
-
[11]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
mplug-docowl 1.5: Unified structure learning for ocr-free document understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3096–3120, 2024
2024
-
[14]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 11
2014
-
[15]
H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109
-
[16]
Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernon- court, Wanrong Zhu, Tianyi Zhou, and Tong Sun. Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025
-
[17]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773, 2024
2024
-
[18]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[19]
Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
2026
-
[20]
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025
-
[21]
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024
-
[22]
Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning
Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024
2024
-
[23]
Georgios Pantazopoulos and Eda B Özyi˘git. Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025
-
[24]
Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025
-
[25]
Cogcom: Train large vision-language models diving into details through chain of manipulations
Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: Train large vision-language models diving into details through chain of manipulations. 2024
2024
-
[26]
Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020
Yanyuan Qiao, Chaorui Deng, and Qi Wu. Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020
2020
-
[27]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
2026
-
[28]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024
2024
-
[29]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[31]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025
-
[34]
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023
2023
-
[36]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Modeling context in referring expressions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016
2016
-
[38]
Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, and Jizhou Huang. Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025
-
[39]
Referring expression comprehension with semantic visual relationship and word mapping
Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. Referring expression comprehension with semantic visual relationship and word mapping. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1258–1266, 2019
2019
-
[40]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Dogr: Towards versatile visual document grounding and referring
Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, and Li Zhu. Dogr: Towards versatile visual document grounding and referring. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3596–3606, 2025
2025
-
[42]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Training Details The base model we use is GLM-4.6V-9B [12]. The whole training process consist...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.