Attention Grounded Enhancement for Visual Document Retrieval
Pith reviewed 2026-05-17 20:49 UTC · model grok-4.3
The pith
Cross-modal attention from multimodal models provides effective local supervision for training visual document retrievers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The AGREE framework extracts attention maps from MLLMs that indicate which document regions are attended to for a given query. These attention scores act as local relevance signals. During training, the retriever is optimized using both these local signals and the global document-level relevance label. This dual supervision allows the model to learn not only document-query matches but also the specific content that drives those matches, resulting in improved performance on visual document retrieval benchmarks.
What carries the argument
The attention maps from multimodal large language models used as proxy supervision to guide identification of relevant document regions in the retriever.
Load-bearing premise
Attention maps from the multimodal large language model reliably indicate the document regions most relevant to the query.
What would settle it
Running the AGREE-trained retriever on a test set where relevant regions have been manually annotated and finding that it does not align better with those annotations than the baseline would falsify the value of the attention supervision.
Figures
read the original abstract
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy implicit information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction to encode holistic information and capture nuanced alignments, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries.To improve fine-grained relevance modeling, we propose a Attention-Grounded REtriever Enhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision to guide the retriever in identifying relevant document regions. Specifically, AGREE extracts attention maps from the MLLM that highlight which document regions are attended to based on the query. These attention scores serve as local, region-level relevance signals. During training, AGREE combines local signals with the global document-level relevance label to jointly optimize the retriever. This dual-level supervision enables the model to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging visual document retrieval benchmark, ViDoRe V2, show that AGREE significantly outperforms the global-supervision-only baseline by 12.82\% and 5.03\% in terms of average nDCG@1 and nDCG@5. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://github.com/VickiCui/AGREE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Attention-Grounded REtriever Enhancement (AGREE) framework for visual document retrieval. It extracts cross-modal attention maps from an MLLM to generate local region-level relevance signals, which are combined with standard global document-level labels to train a screenshot-based retriever that uses fine-grained late interaction. On the ViDoRe V2 benchmark the method reports average gains of 12.82% nDCG@1 and 5.03% nDCG@5 over a global-supervision-only baseline, together with quantitative and qualitative evidence that the dual supervision promotes deeper query-region alignment.
Significance. If the performance lift can be causally attributed to the attention-derived local signals rather than training artifacts or MLLM biases, the work supplies a practical route to fine-grained supervision in visual document retrieval without manual region annotations. The dual-objective formulation directly targets the acknowledged limitation of coarse labels and the public code release aids reproducibility.
major comments (2)
- [Experiments] Experiments section: the headline gains of 12.82% nDCG@1 and 5.03% nDCG@5 are reported without accompanying standard deviations, number of random seeds, or statistical significance tests, so it is impossible to determine whether the improvements exceed training stochasticity.
- [Method] Method section (attention-map extraction and loss combination): the central claim that MLLM cross-modal attention maps supply reliable, query-specific relevance supervision is not supported by any direct comparison to human-annotated relevant regions; without such validation the observed lift could arise from generic saliency or model artifacts rather than semantic alignment.
minor comments (2)
- [Abstract] The abstract states that AGREE 'significantly outperforms' the baseline but does not name the exact baseline architecture or training hyper-parameters used for the comparison.
- [Method] Notation for how attention scores are normalized and injected into the training objective would benefit from an explicit equation in the main text rather than only in the appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will update the paper accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline gains of 12.82% nDCG@1 and 5.03% nDCG@5 are reported without accompanying standard deviations, number of random seeds, or statistical significance tests, so it is impossible to determine whether the improvements exceed training stochasticity.
Authors: We agree that the current reporting leaves the gains vulnerable to questions of stochasticity. In the revision we will rerun all experiments using five random seeds, report mean and standard deviation for nDCG@1 and nDCG@5, and include paired t-test p-values against the global-supervision baseline to establish statistical significance. revision: yes
-
Referee: [Method] Method section (attention-map extraction and loss combination): the central claim that MLLM cross-modal attention maps supply reliable, query-specific relevance supervision is not supported by any direct comparison to human-annotated relevant regions; without such validation the observed lift could arise from generic saliency or model artifacts rather than semantic alignment.
Authors: We acknowledge that a direct human-region annotation study would constitute stronger evidence. The manuscript already supplies indirect support via quantitative alignment metrics (improved query-term to region matching scores) and qualitative visualizations that distinguish semantic focus from generic saliency. We will expand the discussion section to foreground these existing analyses, add an explicit limitations paragraph on the lack of human validation, and, if resources permit, include a small-scale human study in the camera-ready version. revision: partial
Circularity Check
No significant circularity; empirical gains rest on external MLLM attention signals
full rationale
The paper's core contribution is an empirical training procedure that augments global relevance labels with region-level signals extracted from an off-the-shelf MLLM's cross-modal attention maps. No equations, self-citations, or fitted parameters are shown that reduce the claimed nDCG improvements to a tautology or to the input labels themselves. The method is self-contained against the ViDoRe V2 benchmark once the (external) assumption about attention-map reliability is granted; the derivation chain does not collapse by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cross-modal attention maps from MLLMs accurately identify document regions that support query relevance.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision... combines local signals with the global document-level relevance label
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on the challenging visual document retrieval benchmark, ViDoRe V2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Hassan Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio C’esar Teodoro Mendes, Weizhu Chen, Vishrav Chaud- hary, Parul Chopra, Allison Del Giorno, Gustavo de Rosa, Matthew ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Mothilal Asokan, Kebin Wu, and Fatima Albreiki. 2025. FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs. InProceedings of the Computer Vision and Pattern Recognition Conference. 14495–14504
work page 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Chris- tos Kaplanis, Alexey A Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, et al. 2024. Improving fine-grained understanding in image-text pre- training.arXiv preprint arXiv:2401.09865(2024)
-
[7]
Kang Chen and Xiangqian Wu. 2024. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27218–27227
work page 2024
- [8]
- [9]
-
[10]
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers.arXiv preprint arXiv:2309.16588(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Jianfeng Dong, Minsong Zhang, Zheng Zhang, Xianke Chen, Daizong Liu, Xiaoye Qu, Xun Wang, and Baolong Liu. 2023. Dual learning with dynamic knowledge distillation for partially relevant video retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11302–11312
work page 2023
-
[12]
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations
work page 2024
-
[13]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [14]
-
[15]
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)
work page internal anchor Pith review arXiv 2024
-
[16]
Heegon Jin, Seonil Son, Jemin Park, Youngseok Kim, Hyungjong Noh, and Yeon- soo Lee. 2024. Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 722–732
work page 2024
-
[17]
Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Wei Wei, Huiwen Zhao, Zhiwu Lu, et al. 2024. Fineclip: Self-distilled region-based clip for better fine-grained understanding.Advances in Neural Information Processing Systems37 (2024), 27896–27918
work page 2024
-
[18]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781
work page 2020
-
[19]
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48
work page 2020
-
[20]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning.Advances in neural information processing systems33 (2020), 18661– 18673
work page 2020
-
[21]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
- [22]
- [23]
-
[24]
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021). 163–173
work page 2021
-
[25]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
work page 2024
-
[26]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning
work page 2023
-
[27]
Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, and Kang Liu
- [28]
-
[29]
Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645–2652
work page 2020
-
[30]
Haohao Luo, Ying Shen, and Yang Deng. 2023. Unifying text, tables, and images for multimodal question answering. Association for Computational Linguistics
work page 2023
-
[31]
Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using multimodal contrastive knowledge distillation for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology33, 10 (2023), 5486–5497
work page 2023
-
[32]
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin
-
[33]
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Unifying Multimodal Retrieval via Document Screenshot Embedding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6492–6505
work page 2024
- [34]
-
[35]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706
work page 2022
-
[36]
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209
work page 2021
-
[37]
Jamshed Memon, Maira Sami, Rizwan Ahmed Khan, and Mueen Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR).IEEE access8 (2020), 142642–142668
work page 2020
-
[38]
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9826–9836
work page 2021
-
[39]
pdfminer. 2014. pdfminer.six. https://github.com/pdfminer/pdfminer.six
work page 2014
-
[40]
pymupdf. 2012. PyMuPDF. https://github.com/pymupdf/PyMuPDF. Attention Grounded Enhancement for Visual Document Retrieval Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
work page 2012
-
[41]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[42]
Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao
-
[43]
Dynamic contrastive distillation for image-text retrieval.IEEE Transactions on Multimedia25 (2023), 8383–8395
work page 2023
-
[44]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[45]
Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389
work page 2009
-
[46]
Bharat Bhusan Sau, Soumya Roy, Vinay P Namboodiri, and Raghu Sesha Iyengar
-
[47]
Deep Knowledge Distillation using Trainable Dense Attention.. InBMVC. 72
-
[48]
Sungho Shin, Joosoon Lee, Junseok Lee, Yeonguk Yu, and Kyoobin Lee. 2022. Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition. InEuropean Conference on Computer Vision. Springer, 631–647
work page 2022
-
[49]
Ray Smith. 2007. An overview of the Tesseract OCR engine. InNinth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, 629– 633
work page 2007
-
[50]
Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of documentation28, 1 (1972), 11–21
work page 1972
-
[51]
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. Slidevqa: A dataset for document visual question an- swering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13636–13645
work page 2023
-
[52]
Nomic Team. 2025. Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval. https://nomic.ai/blog/posts/nomic- embed-multimodal
work page 2025
-
[53]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [54]
-
[55]
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404
work page 2024
- [56]
- [57]
- [58]
- [59]
-
[60]
Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li. 2023. Unified Language Representation for Question Answering over Text, Tables, and Images. InFindings of the Association for Computational Linguistics: ACL 2023. 4756–4765
work page 2023
-
[61]
Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to atten- tion: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[62]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986
work page 2023
- [63]
-
[64]
Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Jianbing Shen, Guodong Long, Can Xu, and Daxin Jiang. 2024. Fine-grained distillation for long document retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19732–19740
work page 2024
- [65]
-
[66]
Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon
- [67]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.