An LMM for Precisely Grounding Elements in Documents

Chuangxin Zhao; Ji Qi; Juanzi Li; Kai Sun; Lei Hou; Yijian Lu

arxiv: 2606.24118 · v1 · pith:4FBPJ5J2new · submitted 2026-06-23 · 💻 cs.CV

An LMM for Precisely Grounding Elements in Documents

Yijian Lu , Chuangxin Zhao , Kai Sun , Lei Hou , Juanzi Li , Ji Qi This is my paper

Pith reviewed 2026-06-26 01:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords document element groundinglarge multimodal modelssynthetic datavisual groundingreinforcement learningdocument VQAspatial groundingelement localization

0 comments

The pith

PreciseDoc is an LMM that grounds document elements precisely by training on synthetic hand-filled documents with coordinate metadata and using reinforcement learning to jointly supervise grounding and reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PreciseDoc is presented as an LMM tailored for precise element grounding in text-rich documents, where existing models often fail to locate critical elements accurately. The approach relies on generating large amounts of synthetic documents with exact coordinate metadata through pipelines that simulate hand-filled forms and camera distortions. Training incorporates a reinforcement learning paradigm that supervises both grounding and reasoning together to make better use of located evidence. Evaluations on benchmarks show improvements in both grounding precision and downstream document understanding. This matters for applications requiring reliable localization such as research and error detection in documents.

Core claim

PreciseDoc is an LMM specifically designed for precise element grounding in documents. It uses two synthetic data generation pipelines to create high-quality training examples with fine-grained coordinates, including hand-filled documents with camera effects. A training paradigm jointly supervises grounding and reasoning with reinforcement learning. This results in capabilities like locating personal information from CVs and better performance on document spatial grounding and understanding benchmarks.

What carries the argument

Synthetic data generation pipelines that produce documents paired with fine-grained coordinate metadata, combined with a reinforcement learning training paradigm that jointly optimizes grounding and reasoning.

If this is right

The model develops real-world functions such as locating personal information from CVs beyond single-text localization.
Joint reinforcement learning supervision increases the contribution of grounded evidence to overall reasoning.
The approach enables further optimization specifically for Document VQA tasks.
Comprehensive benchmark evaluations demonstrate advantages in both document spatial grounding and document understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Accurate grounding could directly support document error detection by supplying reliable element locations for verification.
Synthetic data pipelines may address scarcity issues in training data for other multimodal grounding problems.
Joint grounding-reasoning training might reduce reliance on ungrounded inferences in document-based chains of reasoning.
The method could integrate into production document processing pipelines where precise spatial references matter.

Load-bearing premise

The synthetic data generation pipelines produce document images whose distribution is close enough to real-world cases that benchmark gains transfer to actual use.

What would settle it

Evaluating the model on a large set of real-world scanned documents with manual ground-truth coordinates and comparing precision to existing LMMs; if no improvement or degradation occurs, the claim fails.

Figures

Figures reproduced from arXiv: 2606.24118 by Chuangxin Zhao, Ji Qi, Juanzi Li, Kai Sun, Lei Hou, Yijian Lu.

read the original abstract

Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PreciseDoc adds synthetic data pipelines and RL joint training for document grounding, but the abstract shows no numbers and no check that the synthetic images match real benchmark distributions.

read the letter

The punchline on this one is that the authors built PreciseDoc by generating lots of synthetic documents with exact coordinate labels through two pipelines—one for hand-filled forms and another adding camera effects—and then used RL to train the model to ground elements while reasoning about them jointly. That's the core new bit.

It does a decent job identifying the problem that standard LMMs struggle with precise localization in dense text documents, and the approach of mass-producing paired data plus the joint supervision makes sense for that.

The soft spots are bigger though. The abstract talks about a comprehensive evaluation showing advantages, but gives zero numbers, no specific benchmarks, no comparison to baselines, and no error analysis. That's not enough to judge if the method actually delivers. More critically, the synthetic data is the foundation, yet there's no evidence presented that these generated images have the same statistics as the real document images in the test sets. Without something like a distribution comparison or an ablation on real data, it's possible the improvements are tied to the synthetic style rather than true grounding ability. The stress-test note is on point here.

This paper is aimed at people working on document VQA and visual grounding in CV. A reader interested in practical improvements for LMMs in that area could get some ideas from the data generation and training setup.

It deserves a serious referee because the problem is real and the proposed solution is concrete enough to review, even if it will likely need more experiments to confirm the claims.

Recommendation: Send it out for review rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces PreciseDoc, an LMM for precise element grounding in document images. It constructs training data via two synthetic pipelines (hand-filled documents plus camera effects) that produce paired fine-grained coordinate metadata, and proposes a reinforcement-learning training paradigm that jointly supervises grounding and reasoning. The central claim is that these data and methods yield advantages in document spatial grounding and document understanding, as shown by comprehensive evaluation on various benchmarks.

Significance. If the reported benchmark gains are real and the synthetic data generalizes, the work would address a recognized weakness of current LMMs on text-rich documents and supply a scalable route to high-precision localization metadata. The joint RL supervision of grounding and reasoning is a plausible direction for improving evidence-based reasoning.

major comments (2)

[Abstract] Abstract: the assertion that 'a comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods' supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the central empirical claim cannot be evaluated.
[Abstract] Abstract (synthetic data pipelines): no distribution-distance metric, realism score, or ablation on real vs. synthetic test splits is reported to verify that the hand-filled documents plus camera effects reproduce the statistics of the real document images appearing in the held-out benchmarks. This is load-bearing for the transfer claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We will revise the abstract to better support the central claims with quantitative highlights from the full manuscript while preserving its concise nature. The full paper already contains the requested evaluations, baselines, and ablations in the experiments section.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'a comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods' supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the central empirical claim cannot be evaluated.

Authors: We agree the abstract would be strengthened by including key quantitative results. The manuscript body reports comprehensive evaluations across multiple benchmarks, including specific metrics (e.g., grounding precision, VQA accuracy), baseline comparisons, ablations on data pipelines and RL training, and error analyses. We will revise the abstract to incorporate representative numbers, baseline references, and mention of ablations to make the empirical claim self-contained at the summary level. revision: yes
Referee: [Abstract] Abstract (synthetic data pipelines): no distribution-distance metric, realism score, or ablation on real vs. synthetic test splits is reported to verify that the hand-filled documents plus camera effects reproduce the statistics of the real document images appearing in the held-out benchmarks. This is load-bearing for the transfer claim.

Authors: The synthetic pipelines are explicitly designed to produce realistic variations via hand-filled content and camera effects to bridge the domain gap. While the current manuscript does not report explicit distribution-distance metrics (e.g., FID) or dedicated real-vs-synthetic ablations, the transfer is validated empirically through strong performance gains on held-out real-world benchmarks. We will add a brief discussion in the revised manuscript on the design rationale for realism and note the empirical evidence of generalization; a full quantitative distribution analysis would require additional experiments beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical modeling contribution with independent evaluation

full rationale

The paper describes construction of synthetic training data via two pipelines, a joint grounding+reasoning training paradigm using reinforcement learning, and benchmark evaluations. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described content. Claims rest on experimental results rather than any derivation chain that reduces to its own inputs by construction. This is a standard empirical LMM paper; the reader's assessment of score 2 is consistent with minor self-citation tolerance but 0 is appropriate here as no circular patterns are exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that synthetic documents with camera effects match real distributions, and that joint RL supervision of grounding and reasoning yields better evidence use than separate training. No free parameters or invented physical entities are described in the abstract.

axioms (2)

domain assumption Synthetic hand-filled documents with camera effects approximate the distribution of real document images
Invoked to justify mass production of training data for real-world functions
domain assumption Joint reinforcement learning of grounding and reasoning improves the contribution of grounded evidence
Core of the proposed training paradigm

pith-pipeline@v0.9.1-grok · 5720 in / 1258 out tokens · 21453 ms · 2026-06-26T01:49:50.688349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 22 canonical work pages · 12 internal anchors

[1]

2026 , bdsk-url-1 =

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

work page arXiv 2025
[2]

Revisiting referring expression comprehension evaluation in the era of large multimodal models

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 513–524, 2025

2025
[3]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

2024
[4]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026. URLhttps://arxiv.org/abs/2601.21957

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd.Advances in Neural Information Processing Systems, 37:42566–42592, 2024

2024
[6]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023
[9]

Boundingdocs: a unified dataset for document question answering with spatial annotations: S

Simone Giovannini, Fabio Coppini, Andrea Gemelli, and Simone Marinai. Boundingdocs: a unified dataset for document question answering with spatial annotations: S. giovannini et al. International Journal on Document Analysis and Recognition (IJDAR), pages 1–16, 2025

2025
[10]

Nist special database 19 handprinted forms and characters database

Patrick Grother. Nist special database 19 handprinted forms and characters database. 1995. URLhttps://api.semanticscholar.org/CorpusID:59785963

1995
[11]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3096–3120, 2024

2024
[14]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 11

2014
[15]

H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109

work page doi:10.1002/nav.3800020109 1955
[16]

Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025

Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernon- court, Wanrong Zhu, Tianyi Zhou, and Tong Sun. Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025

work page arXiv 2025
[17]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773, 2024

2024
[18]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[19]

Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[20]

Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

work page arXiv 2025
[21]

Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966,

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024
[22]

Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024

2024
[23]

Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

Georgios Pantazopoulos and Eda B Özyi˘git. Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

work page arXiv 2025
[24]

Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

work page arXiv 2025
[25]

Cogcom: Train large vision-language models diving into details through chain of manipulations

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: Train large vision-language models diving into details through chain of manipulations. 2024

2024
[26]

Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020

Yanyuan Qiao, Chaorui Deng, and Qi Wu. Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020

2020
[27]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[28]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024
[29]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[31]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

work page arXiv 2025
[34]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023

2023
[36]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

2016
[38]

Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025

Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, and Jizhou Huang. Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025

work page arXiv 2025
[39]

Referring expression comprehension with semantic visual relationship and word mapping

Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. Referring expression comprehension with semantic visual relationship and word mapping. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1258–1266, 2019

2019
[40]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Dogr: Towards versatile visual document grounding and referring

Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, and Li Zhu. Dogr: Towards versatile visual document grounding and referring. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3596–3606, 2025

2025
[42]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Training Details The base model we use is GLM-4.6V-9B [12]. The whole training process consist...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

2026 , bdsk-url-1 =

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

work page arXiv 2025

[2] [2]

Revisiting referring expression comprehension evaluation in the era of large multimodal models

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 513–524, 2025

2025

[3] [3]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

2024

[4] [4]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026. URLhttps://arxiv.org/abs/2601.21957

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd.Advances in Neural Information Processing Systems, 37:42566–42592, 2024

2024

[6] [6]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023

[9] [9]

Boundingdocs: a unified dataset for document question answering with spatial annotations: S

Simone Giovannini, Fabio Coppini, Andrea Gemelli, and Simone Marinai. Boundingdocs: a unified dataset for document question answering with spatial annotations: S. giovannini et al. International Journal on Document Analysis and Recognition (IJDAR), pages 1–16, 2025

2025

[10] [10]

Nist special database 19 handprinted forms and characters database

Patrick Grother. Nist special database 19 handprinted forms and characters database. 1995. URLhttps://api.semanticscholar.org/CorpusID:59785963

1995

[11] [11]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3096–3120, 2024

2024

[14] [14]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 11

2014

[15] [15]

H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109

work page doi:10.1002/nav.3800020109 1955

[16] [16]

Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025

Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernon- court, Wanrong Zhu, Tianyi Zhou, and Tong Sun. Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025

work page arXiv 2025

[17] [17]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773, 2024

2024

[18] [18]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[19] [19]

Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026

[20] [20]

Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

work page arXiv 2025

[21] [21]

Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966,

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024

[22] [22]

Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024

2024

[23] [23]

Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

Georgios Pantazopoulos and Eda B Özyi˘git. Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

work page arXiv 2025

[24] [24]

Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

work page arXiv 2025

[25] [25]

Cogcom: Train large vision-language models diving into details through chain of manipulations

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: Train large vision-language models diving into details through chain of manipulations. 2024

2024

[26] [26]

Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020

Yanyuan Qiao, Chaorui Deng, and Qi Wu. Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020

2020

[27] [27]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[28] [28]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024

[29] [29]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998

[31] [31]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

work page arXiv 2025

[34] [34]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023

2023

[36] [36]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

2016

[38] [38]

Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025

Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, and Jizhou Huang. Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025

work page arXiv 2025

[39] [39]

Referring expression comprehension with semantic visual relationship and word mapping

Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. Referring expression comprehension with semantic visual relationship and word mapping. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1258–1266, 2019

2019

[40] [40]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Dogr: Towards versatile visual document grounding and referring

Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, and Li Zhu. Dogr: Towards versatile visual document grounding and referring. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3596–3606, 2025

2025

[42] [42]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Training Details The base model we use is GLM-4.6V-9B [12]. The whole training process consist...

work page internal anchor Pith review Pith/arXiv arXiv 2025