arxiv: 2604.07419 · v1 · submitted 2026-04-08 · 💻 cs.IR

Recognition: unknown

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Hao Yang , Yifan Ji , Zhipeng Xu , Zhenghao Liu , Yukun Yan , Zulong Chen , Shuo Wang , Yu Gu

show 1 more author

Ge Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 💻 cs.IR

keywords visual document retrievalvision-language modelsfine-grained alignmentranking distribution matchingquery-aware descriptionscontrastive trainingregion grounding

0 comments

The pith

ReAlign trains visual document retrievers by matching rankings from VLM-generated query-aware region descriptions to the original query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ReAlign to improve retrieval of relevant pages from visually complex documents. It uses a strong vision-language model to locate query-related regions on each page and produce detailed, query-focused descriptions of those cropped areas. The retriever is then optimized so that the ranking order of documents produced from the original query matches the ranking order produced from these region-specific descriptions. This alignment pushes the model to attend to scattered but critical visual evidence rather than diffuse layout features. Experiments show consistent gains on multiple benchmarks for both familiar and new document collections.

Core claim

ReAlign enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. It employs a superior VLM to identify query-related regions on a page and generates a query-aware description grounding the cropped visual regions. The retriever is trained using these region-focused descriptions to align the semantics between queries and visual documents by encouraging the document ranking distribution induced by the region-focused descriptions to match that induced by the original query.

What carries the argument

Reasoning-Guided Alignment (ReAlign), which matches the ranking distribution over documents induced by VLM-produced query-aware region descriptions to the distribution induced by the raw query.

If this is right

The method raises retrieval accuracy on both in-domain and out-of-domain visually rich document collections.
Performance gains hold when the underlying VLM backbone is swapped.
The retriever learns to focus attention on critical visual cues instead of complex layouts.
Up to 2 percent relative improvement is observed across standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same region-description alignment could be applied to other multimodal retrieval tasks where evidence is scattered across images or slides.
By outsourcing region grounding to an external VLM, the approach may reduce the amount of human-labeled query-page pairs needed for effective training.
If the alignment step proves robust, it suggests that explicit fine-grained supervision can substitute for some of the data scale required in pure contrastive pre-training of visual retrievers.

Load-bearing premise

A stronger VLM can reliably locate the query-relevant regions on a page and generate descriptions that supply better training signals than contrastive learning on whole-page embeddings.

What would settle it

A controlled experiment in which replacing the VLM region descriptions with random or non-query-aware crops produces equal or higher retrieval scores on the same benchmarks would falsify the value of the alignment step.

Figures

Figures reproduced from arXiv: 2604.07419 by Ge Yu, Hao Yang, Shuo Wang, Yifan Ji, Yu Gu, Yukun Yan, Zhenghao Liu, Zhipeng Xu, Zulong Chen.

**Figure 2.** Figure 2: The Architecture of Reasoning-Guided Visual Document Retrieval ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Validation of the Quality and Diversity of Supervi [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative Analysis of the Learned Embedding [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Quantitative Analysis of the Alignment between [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Case Studies. Regions with higher color intensity indicate stronger attention. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Visual document retrieval aims to retrieve a set of document pages relevant to a query from visually rich collections. Existing methods often employ Vision-Language Models (VLMs) to encode queries and visual pages into a shared embedding space, which is then optimized via contrastive training. However, during visual document representation, localized evidence is usually scattered across complex document layouts, making it difficult for retrieval models to capture crucial cues for effective embedding learning. In this paper, we propose Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. Specifically, ReAlign employs a superior VLM to identify query-related regions on a page and then generates a query-aware description grounding the cropped visual regions. The retriever is then trained using these region-focused descriptions to align the semantics between queries and visual documents by encouraging the document ranking distribution induced by the region-focused descriptions to match that induced by the original query. Experiments on diverse visually rich document retrieval benchmarks demonstrate that ReAlign consistently improves visual document retrieval performance on both in-domain and out-of-domain datasets, achieving up to 2% relative improvements. Moreover, the advantages of ReAlign generalize across different VLM backbones by guiding models to better focus their attention on critical visual cues for document representation. All code and datasets are available at https://github.com/NEUIR/ReAlign.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReAlign adds a VLM-driven region description step and ranking-distribution alignment to contrastive training for visual document retrieval, delivering small consistent gains but resting on the quality of the external VLM signal.

read the letter

ReAlign takes the standard contrastive setup for visual document retrieval and inserts an extra supervision signal: a stronger VLM first locates query-relevant regions on a page, crops them, and writes a focused description. The retriever is then trained so that the ranking distribution it produces from the original query matches the distribution it produces from these descriptions. That alignment step is the concrete new piece. It is a clean way to push the model toward localized evidence without redesigning the encoder or loss from scratch. The paper tests the approach on several visually rich benchmarks, shows gains on both in-domain and out-of-domain sets, and reports that the benefit holds across different VLM backbones. Code and data are released, which makes the result easy to inspect and reproduce. Those are the practical strengths. The reported lifts reach only about 2% relative, which is modest but appears steady. The central assumption is that the stronger VLM’s region choices and descriptions are reliably better targets than raw page embeddings. If the VLM mis-localizes on complex layouts or adds noisy text, the distribution-matching objective could simply propagate that noise rather than improve the embedding space. The experiments do not appear to include direct measurements of region accuracy or an ablation that removes the VLM step entirely, so the contribution of the alignment trick versus the quality of the generated descriptions remains a bit opaque. Still, the numbers hold up across settings and the method is straightforward to implement. This paper is for people already working on multimodal document retrieval who need a lightweight training adjustment rather than a new architecture. It is incremental but honest work with usable artifacts. A serious editor should send it to peer review so the experimental details and any additional ablations can be checked by specialists in the area.

Referee Report

3 major / 1 minor

Summary. The paper claims that ReAlign enhances visual document retrieval by using a superior VLM to identify query-related regions on document pages, generate query-aware descriptions from the cropped regions, and train the retriever to align the document ranking distributions induced by these descriptions with those induced by the original query. Experiments on diverse visually rich document retrieval benchmarks are said to show consistent improvements on both in-domain and out-of-domain datasets, with up to 2% relative gains, and the benefits generalize across VLM backbones.

Significance. If the results hold, ReAlign offers a concrete way to inject fine-grained, reasoning-based supervision into visual document embedding learning, potentially helping models focus on scattered critical cues in complex layouts rather than relying solely on whole-page contrastive training. The public release of code and datasets is a clear strength that supports reproducibility.

major comments (3)

[Method] Method section: The central claim depends on the premise that VLM-generated region descriptions constitute reliably superior supervision over direct contrastive training on raw page embeddings, yet the manuscript provides no ablation removing the VLM step, no region-detection accuracy metrics, and no description-quality evaluation to test this assumption.
[Experiments] Experiments section: The abstract states that consistent improvements were observed up to 2% relative, but supplies no information on baselines, statistical tests, ablation studies, or error bars; without these the support for the central claim cannot be verified.
[Method] Method section: The distribution-matching objective used to align the two ranking distributions is described only at a high level; the precise loss (e.g., KL divergence, ranking loss) and its formulation are not given as an equation, which is load-bearing for understanding and reproducing the training procedure.

minor comments (1)

[Abstract] Abstract: The evaluation metrics (e.g., nDCG@K, Recall@K) underlying the reported improvements are not named.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical validation and methodological precision of the work. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses

Referee: [Method] Method section: The central claim depends on the premise that VLM-generated region descriptions constitute reliably superior supervision over direct contrastive training on raw page embeddings, yet the manuscript provides no ablation removing the VLM step, no region-detection accuracy metrics, and no description-quality evaluation to test this assumption.

Authors: We agree that the manuscript would be strengthened by explicit validation of the VLM-generated supervision signals. The current experiments demonstrate end-to-end gains but do not isolate the VLM component. In the revised version, we will add an ablation study that removes the region identification and query-aware description generation steps, directly comparing ReAlign against standard contrastive training on full-page embeddings. We will also report region-detection accuracy by measuring overlap (e.g., IoU) between VLM-predicted regions and human-annotated query-relevant areas on a sampled subset of documents, along with both automatic metrics and human evaluations of description quality to substantiate the premise. revision: yes
Referee: [Experiments] Experiments section: The abstract states that consistent improvements were observed up to 2% relative, but supplies no information on baselines, statistical tests, ablation studies, or error bars; without these the support for the central claim cannot be verified.

Authors: The referee correctly notes that additional experimental details are needed to fully support the claims. While the manuscript already includes comparisons to multiple baselines across in-domain and out-of-domain benchmarks, we will revise the Experiments section to explicitly enumerate all baselines, report results with error bars from multiple random seeds, include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) for the observed improvements, and expand the ablation studies to cover key design choices. These changes will provide clearer verification of the up to 2% relative gains. revision: yes
Referee: [Method] Method section: The distribution-matching objective used to align the two ranking distributions is described only at a high level; the precise loss (e.g., KL divergence, ranking loss) and its formulation are not given as an equation, which is load-bearing for understanding and reproducing the training procedure.

Authors: We acknowledge that the loss formulation was presented at an insufficient level of detail. The objective aligns the query-induced and description-induced ranking distributions via Kullback-Leibler divergence. In the revised manuscript, we will add the precise mathematical formulation as an equation in the Method section, defining the distributions as softmax-normalized similarities and specifying the loss computation to enable full understanding and reproduction of the training procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: external VLM supervision and empirical validation

full rationale

The paper proposes ReAlign as a training procedure that uses an independent superior VLM to generate fixed query-aware region descriptions as supervision targets; the retriever is then optimized so its induced ranking distribution matches the distribution from those external descriptions. This is not self-referential because the VLM outputs are generated once and held fixed, independent of the retriever parameters being learned. Performance claims rest on experiments across in-domain and out-of-domain benchmarks rather than any closed mathematical derivation or fitted quantity renamed as a prediction. No self-citations, uniqueness theorems, or ansatzes reduce the central method to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that a stronger VLM can produce accurate, query-grounded region descriptions that serve as reliable supervision; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption A superior VLM can accurately locate query-related regions on a page and generate high-quality query-aware descriptions of the cropped visuals.
This capability is invoked as the source of the fine-grained supervision signal.

pith-pipeline@v0.9.0 · 5583 in / 1302 out tokens · 92266 ms · 2026-05-10T17:38:12.975346+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
cs.CL 2026-05 unverdicted novelty 6.0

CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.

Reference graph

Works this paper leans on

104 extracted references · 22 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, et al . 2024. Phi-3 Technical Re- port: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219(2024)

work page internal anchor Pith review arXiv 2024
[2]

Rashad Ahmed, Wasfi G Al-Khatib, and Sabri Mahmoud. 2017. A Survey on handwritten documents word spotting.International Journal of Multimedia Information Retrieval(2017), 31–47

2017
[3]

Fahimeh Alaei, Alireza Alaei, Michael Blumenstein, and Umapada Pal. 2016. A brief review of document image retrieval methods: Recent advances. InProceed- ings of IJCNN. 3500–3507

2016
[4]

Fahimeh Alaei, Alireza Alaei, Umapada Pal, and Michael Blumenstein. 2016. Doc- ument Image Retrieval Based on Texture Features: A Recognition-Free Approach. In2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA). 1–7

2016
[5]

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R Man- matha. 2021. DocFormer: End-to-End Transformer for Document Understanding. InProceedings of ICCV. 993–1003

2021
[6]

Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R Manmatha. 2024. DocFormerv2: local features for document understanding. In Proceedings of AAAI. 709–718

2024
[7]

Yonatan Aumann, Ronen Feldman, Yair Liberzon, Benjamin Rosenfeld, and Jonathan Schler. 2006. Visual information extraction.Knowledge and Information Systems10 (2006), 1–15

2006
[8]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickaël Coustaty, Marçal Rusiñol, Oriol Ramos Terrades, and Josep Lladós. 2025. GlobalDoc: A Cross-Modal Vision- Language Framework for Real-World Document Image Retrieval and Classifica- tion. InProceedings of W ACV. 1436–1446

2025
[10]

Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P Moreira. 2020. Assessing the Impact of OCR Errors in Information Retrieval. InProceedings of ECIR. 102–109

2020
[11]

Giannis Bekoulis, Christina Papagiannopoulou, and Nikos Deligiannis. 2021. A Review on Fact Extraction and Verification.ACM Computing Surveys (CSUR)55 (2021), 1–35

2021
[12]

Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, and Xing Sun. 2023. Attention Where It Matters: Rethink- ing Visual Document Understanding with Selective Region Concentration. In Proceedings of ICCV. 19517–19527

2023
[13]

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. 2025. PaddleOCR 3.0 Technical Report.arXiv preprint arXiv:2507.05595(2025)

work page internal anchor Pith review arXiv 2025
[14]

Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin, Junfeng Ma, and Keping Bi. 2025. Attention Grounded Enhancement for Visual Document Re- trieval.arXiv preprint arXiv:2511.13415(2025)

work page internal anchor Pith review arXiv 2025
[15]

Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InProceedings of ICLR

2024
[16]

David Doermann. 1998. The Indexing and Retrieval of Document Images: A Survey.Computer Vision and Image Understanding70, 3 (1998), 287–298

1998
[17]

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2025. ColPali: Efficient Document Retrieval with Vision Language Models. InProceedings of ICLR

2025
[18]

Jinglun Gao, Yin Zhou, and Kenneth E Barner. 2012. View: Visual Information Extraction Widget for improving chart images accessibility. In2012 19th IEEE International Conference on Image Processing. 2865–2868

2012
[19]

Angelos P Giotis, Giorgos Sfikas, Basilis Gatos, and Christophoros Nikou. 2017. A survey of document image word spotting techniques.Pattern Recognition68 (2017), 310–332

2017
[20]

Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng, Yubo Li, and Hailun Lin. 2025. Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark. InProceedings of CVPR. 29722–29732

2025
[21]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of ICLR

2022
[22]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense infor- mation retrieval with contrastive learning.Transactions on Machine Learning Research (TMLR)(2021)

2021
[23]

Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, and Maosong Sun. 2025. Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking. InProceedings of SIGIR-AP. 292–302

2025
[24]

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. 2024. E5-V: Universal Embeddings with Multimodal Large Language Models.arXiv preprint arXiv:2407.12580(2024)

work page arXiv 2024
[25]

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen
[26]

InProceedings of ICLR

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks. InProceedings of ICLR
[27]

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. 2025. Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. InProceedings of CVPR. 9339–9350

2025
[28]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of EMNLP. 6769–6781

2020
[29]

Wenjun Ke, Yifan Zheng, Yining Li, Hengyuan Xu, Dong Nie, Peng Wang, and Yao He. 2025. Large Language Models in Document Intelligence: A Comprehensive Survey, Recent Advances, Challenges, and Future Trends.ACM Transactions on Information Systems(2025), 1–64

2025
[30]

Mohammadreza Keyvanpour and Reza Tavoli. 2013. Document image retrieval: Algorithms, analysis and promising directions.International Journal of Software Engineering and Its Applications(2013), 93–106

2013
[31]

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-Free Document Understanding Transformer. InProceedings of ECCV. 498–517

2022
[32]

Pranavi Kolouju, Eric Xing, Robert Pless, Nathan Jacobs, and Abby Stylianou
[33]

InProceedings of CVPR

good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval. InProceedings of CVPR. 3148–3157
[34]

Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. 2023. Open-WikiTable : Dataset for Open Domain Question Answering with Complex Reasoning over Table. InFindings of ACL. 8285–8297. ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment Conference’17, July 2017, Washington, DC, USA

2023
[35]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. InProceedings of ICLR

2025
[36]

Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao, and Defu Lian. 2024. Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Re- trieval. InProceedings of ACL. 3490–3500

2024
[37]

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. 2025. DyFo: A Training- Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding. InProceedings of CVPR. 9098–9108

2025
[38]

Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, and Feng Liu. 2024. Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models. InProceedings of ICML

2024
[39]

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. InProceedings of ACL. 14369–14387

2024
[40]

Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, and Tong Sun. 2025. Towards Visual Text Grounding of Multimodal Large Language Model.arXiv preprint arXiv:2504.04974(2025)

work page arXiv 2025
[41]

Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. 2021. More Robust Dense Retrieval with Contrastive Dual Learning. InProceedings of SIGIR. 287–296

2021
[42]

Yinglu Li, Zhiying Lu, Zhihang Liu, Chuanbin Liu, and Hongtao Xie. 2025. Re- gionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding.arXiv preprint arXiv:2510.27261(2025)

work page arXiv 2025
[43]

Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. 2021. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. InProceedings of MM. 1912– 1920

2021
[44]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of SIGIR. 2356–2362

2021
[45]

Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, and Enhong Chen. 2025. Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning.arXiv preprint arXiv:2511.12003(2025)

work page arXiv 2025
[46]

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. 2024. TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document.arXiv preprint arXiv:2403.04473(2024)

work page arXiv 2024
[47]

Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. 2023. Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval. InProceedings of ICLR

2023
[48]

Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. 2025. Multimodal Reference Visual Grounding.arXiv preprint arXiv:2504.02876(2025)

work page arXiv 2025
[49]

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin
[50]

In Proceedings of EMNLP

Unifying Multimodal Retrieval via Document Screenshot Embedding. In Proceedings of EMNLP. 6492–6505
[51]

Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval.arXiv preprint arXiv:2505.17166(2025)

work page arXiv 2025
[52]

Simone Marinai, Beatrice Miotti, and Giovanni Soda. 2011. Digital Libraries and Document Image Retrieval Techniques: A Survey. InLearning Structure and Schemas from Documents. Studies in Computational Intelligence, Vol. 375. Springer, 181–204

2011
[53]

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque
[54]

InFindings of ACL

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of ACL. 2263–2279
[55]

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. InfographicVQA. InProceedings of W ACV. 1697–1706

2022
[56]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. InProceedings of W ACV. 2200–2209

2021
[57]

Jie Mei, Aminul Islam, Abidalrahman Moh’d, Yajing Wu, and Evangelos Milios
[58]

Statistical learning for OCR error correction.Information Processing & Management(2018), 874–887

2018
[59]

Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. 2020. PlotQA: Reasoning over Scientific Plots. InProceedings of W ACV. 1527–1536

2020
[60]

Thong Nguyen, Yibin Lei, Jia-Huei Ju, and Andrew Yates. 2025. SERVAL: Surpris- ingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models. InProceedings of EMNLP. 30807–30822

2025
[61]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Yuhui Cao, Weichong Yin, Yongfeng Chen, Yin Zhang, et al. 2022. ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Under- standing. InFindings of EMNLP. 3744–3756

2022
[63]

Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Pałka. 2021. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. InInternational Conference on Document Analysis and Recognition. 732–747

2021
[64]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
[65]

InProceedings of ICML, Vol

Learning Transferable Visual Models From Natural Language Supervision. InProceedings of ICML, Vol. 139. 8748–8763
[66]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389

2009
[67]

Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel Roberto Filizzola Ortiz, Enrico Santus, and Regina Barzilay. 2019. Towards Debiasing Fact Verification Models. InProceedings of EMNLP-IJCNLP. 3419–3425

2019
[68]

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Ming- wei Zhu, and Jianwei Yin. 2025. ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration. In Proceedings of EMNLP. 6602–6618

2025
[69]

Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where To Look: Focus Regions for Visual Question Answering. InProceedings of CVPR. 4613–4621

2016
[70]

Gyuho Shim, Seongtae Hong, and Heui-Seok Lim. 2025. REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy. InProceedings of ACL. 1423–1434

2025
[71]

Minchae Song. 2026. Defining the problem: The impact of OCR quality on retrieval-augmented generation performance and strategies for improvement. Information Processing & Management63, 1 (2026), 104368

2026
[72]

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, and Yan Zhang. 2025. Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval. InProceedings of ACL. 23935–23945

2025
[73]

Kazutaka Takeda, Koichi Kise, and Masakazu Iwamura. 2011. Real-Time Docu- ment Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH. In2011 International Conference on Document Analysis and Recognition. 1054–1058

2011
[74]

Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. 2025. VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents. InProceedings of CVPR. 24827–24837

2025
[75]

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. InProceedings of AAAI. 13636–13645

2023
[76]

Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. 2021. VisualMRC: Machine Reading Comprehension on Document Images. InProceedings of AAAI. 13878– 13888

2021
[77]

Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, and Manuel Faysse. 2025. ModernVBERT: Towards Smaller Visual Document Retrievers.arXiv preprint arXiv:2510.01149(2025)

work page arXiv 2025
[78]

Anyang Tong, Xiang Niu, ZhiPing Liu, Chang Tian, Yanyan Wei, Zenglin Shi, and Meng Wang. 2025. HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents.arXiv preprint arXiv:2511.20227(2025)

work page arXiv 2025
[79]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features.arXiv preprint arXiv:2502.14786(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. 2023. Document Understanding Dataset and Evaluation (DUDE). InProceedings of ICCV. 19528–19540

2023

Showing first 80 references.