Recognition: unknown
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3
The pith
ReAlign trains visual document retrievers by matching rankings from VLM-generated query-aware region descriptions to the original query.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReAlign enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. It employs a superior VLM to identify query-related regions on a page and generates a query-aware description grounding the cropped visual regions. The retriever is trained using these region-focused descriptions to align the semantics between queries and visual documents by encouraging the document ranking distribution induced by the region-focused descriptions to match that induced by the original query.
What carries the argument
Reasoning-Guided Alignment (ReAlign), which matches the ranking distribution over documents induced by VLM-produced query-aware region descriptions to the distribution induced by the raw query.
If this is right
- The method raises retrieval accuracy on both in-domain and out-of-domain visually rich document collections.
- Performance gains hold when the underlying VLM backbone is swapped.
- The retriever learns to focus attention on critical visual cues instead of complex layouts.
- Up to 2 percent relative improvement is observed across standard benchmarks.
Where Pith is reading between the lines
- The same region-description alignment could be applied to other multimodal retrieval tasks where evidence is scattered across images or slides.
- By outsourcing region grounding to an external VLM, the approach may reduce the amount of human-labeled query-page pairs needed for effective training.
- If the alignment step proves robust, it suggests that explicit fine-grained supervision can substitute for some of the data scale required in pure contrastive pre-training of visual retrievers.
Load-bearing premise
A stronger VLM can reliably locate the query-relevant regions on a page and generate descriptions that supply better training signals than contrastive learning on whole-page embeddings.
What would settle it
A controlled experiment in which replacing the VLM region descriptions with random or non-query-aware crops produces equal or higher retrieval scores on the same benchmarks would falsify the value of the alignment step.
Figures
read the original abstract
Visual document retrieval aims to retrieve a set of document pages relevant to a query from visually rich collections. Existing methods often employ Vision-Language Models (VLMs) to encode queries and visual pages into a shared embedding space, which is then optimized via contrastive training. However, during visual document representation, localized evidence is usually scattered across complex document layouts, making it difficult for retrieval models to capture crucial cues for effective embedding learning. In this paper, we propose Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. Specifically, ReAlign employs a superior VLM to identify query-related regions on a page and then generates a query-aware description grounding the cropped visual regions. The retriever is then trained using these region-focused descriptions to align the semantics between queries and visual documents by encouraging the document ranking distribution induced by the region-focused descriptions to match that induced by the original query. Experiments on diverse visually rich document retrieval benchmarks demonstrate that ReAlign consistently improves visual document retrieval performance on both in-domain and out-of-domain datasets, achieving up to 2% relative improvements. Moreover, the advantages of ReAlign generalize across different VLM backbones by guiding models to better focus their attention on critical visual cues for document representation. All code and datasets are available at https://github.com/NEUIR/ReAlign.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ReAlign enhances visual document retrieval by using a superior VLM to identify query-related regions on document pages, generate query-aware descriptions from the cropped regions, and train the retriever to align the document ranking distributions induced by these descriptions with those induced by the original query. Experiments on diverse visually rich document retrieval benchmarks are said to show consistent improvements on both in-domain and out-of-domain datasets, with up to 2% relative gains, and the benefits generalize across VLM backbones.
Significance. If the results hold, ReAlign offers a concrete way to inject fine-grained, reasoning-based supervision into visual document embedding learning, potentially helping models focus on scattered critical cues in complex layouts rather than relying solely on whole-page contrastive training. The public release of code and datasets is a clear strength that supports reproducibility.
major comments (3)
- [Method] Method section: The central claim depends on the premise that VLM-generated region descriptions constitute reliably superior supervision over direct contrastive training on raw page embeddings, yet the manuscript provides no ablation removing the VLM step, no region-detection accuracy metrics, and no description-quality evaluation to test this assumption.
- [Experiments] Experiments section: The abstract states that consistent improvements were observed up to 2% relative, but supplies no information on baselines, statistical tests, ablation studies, or error bars; without these the support for the central claim cannot be verified.
- [Method] Method section: The distribution-matching objective used to align the two ranking distributions is described only at a high level; the precise loss (e.g., KL divergence, ranking loss) and its formulation are not given as an equation, which is load-bearing for understanding and reproducing the training procedure.
minor comments (1)
- [Abstract] Abstract: The evaluation metrics (e.g., nDCG@K, Recall@K) underlying the reported improvements are not named.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical validation and methodological precision of the work. We address each major comment point by point below, indicating the revisions we will make.
read point-by-point responses
-
Referee: [Method] Method section: The central claim depends on the premise that VLM-generated region descriptions constitute reliably superior supervision over direct contrastive training on raw page embeddings, yet the manuscript provides no ablation removing the VLM step, no region-detection accuracy metrics, and no description-quality evaluation to test this assumption.
Authors: We agree that the manuscript would be strengthened by explicit validation of the VLM-generated supervision signals. The current experiments demonstrate end-to-end gains but do not isolate the VLM component. In the revised version, we will add an ablation study that removes the region identification and query-aware description generation steps, directly comparing ReAlign against standard contrastive training on full-page embeddings. We will also report region-detection accuracy by measuring overlap (e.g., IoU) between VLM-predicted regions and human-annotated query-relevant areas on a sampled subset of documents, along with both automatic metrics and human evaluations of description quality to substantiate the premise. revision: yes
-
Referee: [Experiments] Experiments section: The abstract states that consistent improvements were observed up to 2% relative, but supplies no information on baselines, statistical tests, ablation studies, or error bars; without these the support for the central claim cannot be verified.
Authors: The referee correctly notes that additional experimental details are needed to fully support the claims. While the manuscript already includes comparisons to multiple baselines across in-domain and out-of-domain benchmarks, we will revise the Experiments section to explicitly enumerate all baselines, report results with error bars from multiple random seeds, include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) for the observed improvements, and expand the ablation studies to cover key design choices. These changes will provide clearer verification of the up to 2% relative gains. revision: yes
-
Referee: [Method] Method section: The distribution-matching objective used to align the two ranking distributions is described only at a high level; the precise loss (e.g., KL divergence, ranking loss) and its formulation are not given as an equation, which is load-bearing for understanding and reproducing the training procedure.
Authors: We acknowledge that the loss formulation was presented at an insufficient level of detail. The objective aligns the query-induced and description-induced ranking distributions via Kullback-Leibler divergence. In the revised manuscript, we will add the precise mathematical formulation as an equation in the Method section, defining the distributions as softmax-normalized similarities and specifying the loss computation to enable full understanding and reproduction of the training procedure. revision: yes
Circularity Check
No circularity: external VLM supervision and empirical validation
full rationale
The paper proposes ReAlign as a training procedure that uses an independent superior VLM to generate fixed query-aware region descriptions as supervision targets; the retriever is then optimized so its induced ranking distribution matches the distribution from those external descriptions. This is not self-referential because the VLM outputs are generated once and held fixed, independent of the retriever parameters being learned. Performance claims rest on experiments across in-domain and out-of-domain benchmarks rather than any closed mathematical derivation or fitted quantity renamed as a prediction. No self-citations, uniqueness theorems, or ansatzes reduce the central method to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A superior VLM can accurately locate query-related regions on a page and generate high-quality query-aware descriptions of the cropped visuals.
Forward citations
Cited by 1 Pith paper
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, et al . 2024. Phi-3 Technical Re- port: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219(2024)
work page internal anchor Pith review arXiv 2024
-
[2]
Rashad Ahmed, Wasfi G Al-Khatib, and Sabri Mahmoud. 2017. A Survey on handwritten documents word spotting.International Journal of Multimedia Information Retrieval(2017), 31–47
2017
-
[3]
Fahimeh Alaei, Alireza Alaei, Michael Blumenstein, and Umapada Pal. 2016. A brief review of document image retrieval methods: Recent advances. InProceed- ings of IJCNN. 3500–3507
2016
-
[4]
Fahimeh Alaei, Alireza Alaei, Umapada Pal, and Michael Blumenstein. 2016. Doc- ument Image Retrieval Based on Texture Features: A Recognition-Free Approach. In2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA). 1–7
2016
-
[5]
Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R Man- matha. 2021. DocFormer: End-to-End Transformer for Document Understanding. InProceedings of ICCV. 993–1003
2021
-
[6]
Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R Manmatha. 2024. DocFormerv2: local features for document understanding. In Proceedings of AAAI. 709–718
2024
-
[7]
Yonatan Aumann, Ronen Feldman, Yair Liberzon, Benjamin Rosenfeld, and Jonathan Schler. 2006. Visual information extraction.Knowledge and Information Systems10 (2006), 1–15
2006
-
[8]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickaël Coustaty, Marçal Rusiñol, Oriol Ramos Terrades, and Josep Lladós. 2025. GlobalDoc: A Cross-Modal Vision- Language Framework for Real-World Document Image Retrieval and Classifica- tion. InProceedings of W ACV. 1436–1446
2025
-
[10]
Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P Moreira. 2020. Assessing the Impact of OCR Errors in Information Retrieval. InProceedings of ECIR. 102–109
2020
-
[11]
Giannis Bekoulis, Christina Papagiannopoulou, and Nikos Deligiannis. 2021. A Review on Fact Extraction and Verification.ACM Computing Surveys (CSUR)55 (2021), 1–35
2021
-
[12]
Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, and Xing Sun. 2023. Attention Where It Matters: Rethink- ing Visual Document Understanding with Selective Region Concentration. In Proceedings of ICCV. 19517–19527
2023
-
[13]
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. 2025. PaddleOCR 3.0 Technical Report.arXiv preprint arXiv:2507.05595(2025)
work page internal anchor Pith review arXiv 2025
-
[14]
Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin, Junfeng Ma, and Keping Bi. 2025. Attention Grounded Enhancement for Visual Document Re- trieval.arXiv preprint arXiv:2511.13415(2025)
work page internal anchor Pith review arXiv 2025
-
[15]
Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InProceedings of ICLR
2024
-
[16]
David Doermann. 1998. The Indexing and Retrieval of Document Images: A Survey.Computer Vision and Image Understanding70, 3 (1998), 287–298
1998
-
[17]
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2025. ColPali: Efficient Document Retrieval with Vision Language Models. InProceedings of ICLR
2025
-
[18]
Jinglun Gao, Yin Zhou, and Kenneth E Barner. 2012. View: Visual Information Extraction Widget for improving chart images accessibility. In2012 19th IEEE International Conference on Image Processing. 2865–2868
2012
-
[19]
Angelos P Giotis, Giorgos Sfikas, Basilis Gatos, and Christophoros Nikou. 2017. A survey of document image word spotting techniques.Pattern Recognition68 (2017), 310–332
2017
-
[20]
Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng, Yubo Li, and Hailun Lin. 2025. Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark. InProceedings of CVPR. 29722–29732
2025
-
[21]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of ICLR
2022
-
[22]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense infor- mation retrieval with contrastive learning.Transactions on Machine Learning Research (TMLR)(2021)
2021
-
[23]
Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, and Maosong Sun. 2025. Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking. InProceedings of SIGIR-AP. 292–302
2025
- [24]
-
[25]
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen
-
[26]
InProceedings of ICLR
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks. InProceedings of ICLR
-
[27]
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. 2025. Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. InProceedings of CVPR. 9339–9350
2025
-
[28]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of EMNLP. 6769–6781
2020
-
[29]
Wenjun Ke, Yifan Zheng, Yining Li, Hengyuan Xu, Dong Nie, Peng Wang, and Yao He. 2025. Large Language Models in Document Intelligence: A Comprehensive Survey, Recent Advances, Challenges, and Future Trends.ACM Transactions on Information Systems(2025), 1–64
2025
-
[30]
Mohammadreza Keyvanpour and Reza Tavoli. 2013. Document image retrieval: Algorithms, analysis and promising directions.International Journal of Software Engineering and Its Applications(2013), 93–106
2013
-
[31]
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-Free Document Understanding Transformer. InProceedings of ECCV. 498–517
2022
-
[32]
Pranavi Kolouju, Eric Xing, Robert Pless, Nathan Jacobs, and Abby Stylianou
-
[33]
InProceedings of CVPR
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval. InProceedings of CVPR. 3148–3157
-
[34]
Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. 2023. Open-WikiTable : Dataset for Open Domain Question Answering with Complex Reasoning over Table. InFindings of ACL. 8285–8297. ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment Conference’17, July 2017, Washington, DC, USA
2023
-
[35]
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. InProceedings of ICLR
2025
-
[36]
Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao, and Defu Lian. 2024. Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Re- trieval. InProceedings of ACL. 3490–3500
2024
-
[37]
Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. 2025. DyFo: A Training- Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding. InProceedings of CVPR. 9098–9108
2025
-
[38]
Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, and Feng Liu. 2024. Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models. InProceedings of ICML
2024
-
[39]
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. InProceedings of ACL. 14369–14387
2024
- [40]
-
[41]
Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. 2021. More Robust Dense Retrieval with Contrastive Dual Learning. InProceedings of SIGIR. 287–296
2021
- [42]
-
[43]
Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. 2021. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. InProceedings of MM. 1912– 1920
2021
-
[44]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of SIGIR. 2356–2362
2021
- [45]
- [46]
-
[47]
Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. 2023. Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval. InProceedings of ICLR
2023
- [48]
-
[49]
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin
-
[50]
In Proceedings of EMNLP
Unifying Multimodal Retrieval via Document Screenshot Embedding. In Proceedings of EMNLP. 6492–6505
- [51]
-
[52]
Simone Marinai, Beatrice Miotti, and Giovanni Soda. 2011. Digital Libraries and Document Image Retrieval Techniques: A Survey. InLearning Structure and Schemas from Documents. Studies in Computational Intelligence, Vol. 375. Springer, 181–204
2011
-
[53]
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque
-
[54]
InFindings of ACL
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of ACL. 2263–2279
-
[55]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. InfographicVQA. InProceedings of W ACV. 1697–1706
2022
-
[56]
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. InProceedings of W ACV. 2200–2209
2021
-
[57]
Jie Mei, Aminul Islam, Abidalrahman Moh’d, Yajing Wu, and Evangelos Milios
-
[58]
Statistical learning for OCR error correction.Information Processing & Management(2018), 874–887
2018
-
[59]
Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. 2020. PlotQA: Reasoning over Scientific Plots. InProceedings of W ACV. 1527–1536
2020
-
[60]
Thong Nguyen, Yibin Lei, Jia-Huei Ju, and Andrew Yates. 2025. SERVAL: Surpris- ingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models. InProceedings of EMNLP. 30807–30822
2025
-
[61]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[62]
Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Yuhui Cao, Weichong Yin, Yongfeng Chen, Yin Zhang, et al. 2022. ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Under- standing. InFindings of EMNLP. 3744–3756
2022
-
[63]
Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Pałka. 2021. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. InInternational Conference on Document Analysis and Recognition. 732–747
2021
-
[64]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[65]
InProceedings of ICML, Vol
Learning Transferable Visual Models From Natural Language Supervision. InProceedings of ICML, Vol. 139. 8748–8763
-
[66]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389
2009
-
[67]
Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel Roberto Filizzola Ortiz, Enrico Santus, and Regina Barzilay. 2019. Towards Debiasing Fact Verification Models. InProceedings of EMNLP-IJCNLP. 3419–3425
2019
-
[68]
Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Ming- wei Zhu, and Jianwei Yin. 2025. ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration. In Proceedings of EMNLP. 6602–6618
2025
-
[69]
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where To Look: Focus Regions for Visual Question Answering. InProceedings of CVPR. 4613–4621
2016
-
[70]
Gyuho Shim, Seongtae Hong, and Heui-Seok Lim. 2025. REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy. InProceedings of ACL. 1423–1434
2025
-
[71]
Minchae Song. 2026. Defining the problem: The impact of OCR quality on retrieval-augmented generation performance and strategies for improvement. Information Processing & Management63, 1 (2026), 104368
2026
-
[72]
Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, and Yan Zhang. 2025. Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval. InProceedings of ACL. 23935–23945
2025
-
[73]
Kazutaka Takeda, Koichi Kise, and Masakazu Iwamura. 2011. Real-Time Docu- ment Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH. In2011 International Conference on Document Analysis and Recognition. 1054–1058
2011
-
[74]
Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. 2025. VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents. InProceedings of CVPR. 24827–24837
2025
-
[75]
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. InProceedings of AAAI. 13636–13645
2023
-
[76]
Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. 2021. VisualMRC: Machine Reading Comprehension on Document Images. InProceedings of AAAI. 13878– 13888
2021
- [77]
- [78]
-
[79]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features.arXiv preprint arXiv:2502.14786(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. 2023. Document Understanding Dataset and Evaluation (DUDE). InProceedings of ICCV. 19528–19540
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.