pith. sign in

arxiv: 2511.13415 · v2 · submitted 2025-11-17 · 💻 cs.IR · cs.CL· cs.CV

Attention Grounded Enhancement for Visual Document Retrieval

Pith reviewed 2026-05-17 20:49 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.CV
keywords visual document retrievalattention grounded enhancementmultimodal large language modelsproxy supervisionfine-grained relevanceregion-level signalsretrieval performance
0
0 comments X

The pith

Cross-modal attention from multimodal models provides effective local supervision for training visual document retrievers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to enhance visual document retrieval by using attention maps extracted from multimodal large language models as additional training signals. These maps highlight relevant regions within documents based on the query, which are then used alongside global relevance labels to train the retriever. This addresses the limitation of relying only on coarse labels that do not indicate which parts of the document support the match. As a result, the retriever can better capture nuanced and implicit semantic connections instead of surface-level cues. A sympathetic reader would care because this leads to more accurate retrieval for complex, non-extractive information needs in documents.

Core claim

The AGREE framework extracts attention maps from MLLMs that indicate which document regions are attended to for a given query. These attention scores act as local relevance signals. During training, the retriever is optimized using both these local signals and the global document-level relevance label. This dual supervision allows the model to learn not only document-query matches but also the specific content that drives those matches, resulting in improved performance on visual document retrieval benchmarks.

What carries the argument

The attention maps from multimodal large language models used as proxy supervision to guide identification of relevant document regions in the retriever.

Load-bearing premise

Attention maps from the multimodal large language model reliably indicate the document regions most relevant to the query.

What would settle it

Running the AGREE-trained retriever on a test set where relevant regions have been manually annotated and finding that it does not align better with those annotations than the baseline would falsify the value of the attention supervision.

Figures

Figures reproduced from arXiv: 2511.13415 by Junfeng Ma, Keping Bi, Meiguang Jin, Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu.

Figure 1
Figure 1. Figure 1: The similarity map of ColQwen2.5 (left), versus the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AGREE training framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human annotation (left) and top-3% high attention [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Results on ViDoRe V1 using PaliGemma and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Retrieval performance on ViDoRe V2 related to the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Coverage of human-annotated matching areas by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Late interaction heatmaps of ColQwen2.5 (left) and AGREEQwen2.5 (middle). The right panels zoom in on key regions. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy implicit information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction to encode holistic information and capture nuanced alignments, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries.To improve fine-grained relevance modeling, we propose a Attention-Grounded REtriever Enhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision to guide the retriever in identifying relevant document regions. Specifically, AGREE extracts attention maps from the MLLM that highlight which document regions are attended to based on the query. These attention scores serve as local, region-level relevance signals. During training, AGREE combines local signals with the global document-level relevance label to jointly optimize the retriever. This dual-level supervision enables the model to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging visual document retrieval benchmark, ViDoRe V2, show that AGREE significantly outperforms the global-supervision-only baseline by 12.82\% and 5.03\% in terms of average nDCG@1 and nDCG@5. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://github.com/VickiCui/AGREE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Attention-Grounded REtriever Enhancement (AGREE) framework for visual document retrieval. It extracts cross-modal attention maps from an MLLM to generate local region-level relevance signals, which are combined with standard global document-level labels to train a screenshot-based retriever that uses fine-grained late interaction. On the ViDoRe V2 benchmark the method reports average gains of 12.82% nDCG@1 and 5.03% nDCG@5 over a global-supervision-only baseline, together with quantitative and qualitative evidence that the dual supervision promotes deeper query-region alignment.

Significance. If the performance lift can be causally attributed to the attention-derived local signals rather than training artifacts or MLLM biases, the work supplies a practical route to fine-grained supervision in visual document retrieval without manual region annotations. The dual-objective formulation directly targets the acknowledged limitation of coarse labels and the public code release aids reproducibility.

major comments (2)
  1. [Experiments] Experiments section: the headline gains of 12.82% nDCG@1 and 5.03% nDCG@5 are reported without accompanying standard deviations, number of random seeds, or statistical significance tests, so it is impossible to determine whether the improvements exceed training stochasticity.
  2. [Method] Method section (attention-map extraction and loss combination): the central claim that MLLM cross-modal attention maps supply reliable, query-specific relevance supervision is not supported by any direct comparison to human-annotated relevant regions; without such validation the observed lift could arise from generic saliency or model artifacts rather than semantic alignment.
minor comments (2)
  1. [Abstract] The abstract states that AGREE 'significantly outperforms' the baseline but does not name the exact baseline architecture or training hyper-parameters used for the comparison.
  2. [Method] Notation for how attention scores are normalized and injected into the training objective would benefit from an explicit equation in the main text rather than only in the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will update the paper accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline gains of 12.82% nDCG@1 and 5.03% nDCG@5 are reported without accompanying standard deviations, number of random seeds, or statistical significance tests, so it is impossible to determine whether the improvements exceed training stochasticity.

    Authors: We agree that the current reporting leaves the gains vulnerable to questions of stochasticity. In the revision we will rerun all experiments using five random seeds, report mean and standard deviation for nDCG@1 and nDCG@5, and include paired t-test p-values against the global-supervision baseline to establish statistical significance. revision: yes

  2. Referee: [Method] Method section (attention-map extraction and loss combination): the central claim that MLLM cross-modal attention maps supply reliable, query-specific relevance supervision is not supported by any direct comparison to human-annotated relevant regions; without such validation the observed lift could arise from generic saliency or model artifacts rather than semantic alignment.

    Authors: We acknowledge that a direct human-region annotation study would constitute stronger evidence. The manuscript already supplies indirect support via quantitative alignment metrics (improved query-term to region matching scores) and qualitative visualizations that distinguish semantic focus from generic saliency. We will expand the discussion section to foreground these existing analyses, add an explicit limitations paragraph on the lack of human validation, and, if resources permit, include a small-scale human study in the camera-ready version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external MLLM attention signals

full rationale

The paper's core contribution is an empirical training procedure that augments global relevance labels with region-level signals extracted from an off-the-shelf MLLM's cross-modal attention maps. No equations, self-citations, or fitted parameters are shown that reduce the claimed nDCG improvements to a tautology or to the input labels themselves. The method is self-contained against the ViDoRe V2 benchmark once the (external) assumption about attention-map reliability is granted; the derivation chain does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MLLM attention maps provide valid region-level relevance signals; no free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Cross-modal attention maps from MLLMs accurately identify document regions that support query relevance.
    This assumption is invoked when the attention scores are treated as proxy supervision without further validation described in the abstract.

pith-pipeline@v0.9.0 · 5620 in / 1075 out tokens · 29483 ms · 2026-05-17T20:49:33.876903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

    cs.IR 2026-04 unverdicted novelty 6.0

    ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Hassan Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio C’esar Teodoro Mendes, Weizhu Chen, Vishrav Chaud- hary, Parul Chopra, Allison Del Giorno, Gustavo de Rosa, Matthew ...

  2. [2]

    Mothilal Asokan, Kebin Wu, and Fatima Albreiki. 2025. FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs. InProceedings of the Computer Vision and Pattern Recognition Conference. 14495–14504

  3. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  4. [5]

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)

  5. [6]

    Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Chris- tos Kaplanis, Alexey A Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, et al. 2024. Improving fine-grained understanding in image-text pre- training.arXiv preprint arXiv:2401.09865(2024)

  6. [7]

    Kang Chen and Xiangqian Wu. 2024. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27218–27227

  7. [8]

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952(2024)

  8. [9]

    Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O’Brien, and Vasu Sharma. 2025. Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation.arXiv preprint arXiv:2505.21549(2025)

  9. [10]

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers.arXiv preprint arXiv:2309.16588(2023)

  10. [11]

    Jianfeng Dong, Minsong Zhang, Zheng Zhang, Xianke Chen, Daizong Liu, Xiaoye Qu, Xun Wang, and Baolong Liu. 2023. Dual learning with dynamic knowledge distillation for partially relevant video retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11302–11312

  11. [12]

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations

  12. [13]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

  13. [14]

    Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question answering.arXiv preprint arXiv:2012.04584(2020)

  14. [15]

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)

  15. [16]

    Heegon Jin, Seonil Son, Jemin Park, Youngseok Kim, Hyungjong Noh, and Yeon- soo Lee. 2024. Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 722–732

  16. [17]

    Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Wei Wei, Huiwen Zhao, Zhiwu Lu, et al. 2024. Fineclip: Self-distilled region-based clip for better fine-grained understanding.Advances in Neural Information Processing Systems37 (2024), 27896–27918

  17. [18]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

  18. [19]

    Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

  19. [20]

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning.Advances in neural information processing systems33 (2020), 18661– 18673

  20. [21]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  21. [22]

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231(2024)

  22. [23]

    Zizhong Li, Haopeng Zhang, and Jiawei Zhang. 2024. Intermediate distillation: Data-efficient distillation from black-box llms for information retrieval.arXiv preprint arXiv:2406.12169(2024)

  23. [24]

    Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021). 163–173

  24. [25]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

  25. [26]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning

  26. [27]

    Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, and Kang Liu

  27. [28]

    Mmhqa-icl: Multimodal in-context learning for hybrid question answering over text, tables and images.arXiv preprint arXiv:2309.04790(2023)

  28. [29]

    Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645–2652

  29. [30]

    Haohao Luo, Ying Shen, and Yang Deng. 2023. Unifying text, tables, and images for multimodal question answering. Association for Computational Linguistics

  30. [31]

    Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using multimodal contrastive knowledge distillation for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology33, 10 (2023), 5486–5497

  31. [32]

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin

  32. [33]

    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Unifying Multimodal Retrieval via Document Screenshot Embedding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6492–6505

  33. [34]

    Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arXiv:2505.17166 [cs.IR] https://arxiv. org/abs/2505.17166

  34. [35]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706

  35. [36]

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209

  36. [37]

    Jamshed Memon, Maira Sami, Rizwan Ahmed Khan, and Mueen Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR).IEEE access8 (2020), 142642–142668

  37. [38]

    Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9826–9836

  38. [39]

    pdfminer. 2014. pdfminer.six. https://github.com/pdfminer/pdfminer.six

  39. [40]

    pymupdf. 2012. PyMuPDF. https://github.com/pymupdf/PyMuPDF. Attention Grounded Enhancement for Visual Document Retrieval Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

  40. [41]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  41. [42]

    Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao

  42. [43]

    Dynamic contrastive distillation for image-text retrieval.IEEE Transactions on Multimedia25 (2023), 8383–8395

  43. [44]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

  44. [45]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

  45. [46]

    Bharat Bhusan Sau, Soumya Roy, Vinay P Namboodiri, and Raghu Sesha Iyengar

  46. [47]

    Deep Knowledge Distillation using Trainable Dense Attention.. InBMVC. 72

  47. [48]

    Sungho Shin, Joosoon Lee, Junseok Lee, Yeonguk Yu, and Kyoobin Lee. 2022. Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition. InEuropean Conference on Computer Vision. Springer, 631–647

  48. [49]

    Ray Smith. 2007. An overview of the Tesseract OCR engine. InNinth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, 629– 633

  49. [50]

    Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of documentation28, 1 (1972), 11–21

  50. [51]

    Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. Slidevqa: A dataset for document visual question an- swering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13636–13645

  51. [52]

    Nomic Team. 2025. Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval. https://nomic.ai/blog/posts/nomic- embed-multimodal

  52. [53]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

  53. [54]

    Pengfei Wang, Guohai Xu, Weinong Wang, Junjie Yang, Jie Lou, and Yunhua Xue. 2025. Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis.arXiv preprint arXiv:2505.10541 (2025)

  54. [55]

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

  55. [56]

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. 2025. FG-CLIP: Fine-Grained Visual and Textual Alignment.arXiv preprint arXiv:2505.05071(2025)

  56. [57]

    Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. 2025. Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model. arXiv:2507.05513 [cs.CV] https://arxiv.org/abs/2507.05513

  57. [58]

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiao- dan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783(2021)

  58. [59]

    Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2022. Retrieval- augmented multimodal language modeling.arXiv preprint arXiv:2211.12561 (2022)

  59. [60]

    Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li. 2023. Unified Language Representation for Question Answering over Text, Tables, and Images. InFindings of the Association for Computational Linguistics: ACL 2023. 4756–4765

  60. [61]

    Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to atten- tion: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928(2016)

  61. [62]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

  62. [63]

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. 2025. Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422(2025)

  63. [64]

    Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Jianbing Shen, Guodong Long, Can Xu, and Daxin Jiang. 2024. Fine-grained distillation for long document retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19732–19740

  64. [65]

    Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624(2021)

  65. [66]

    Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon

  66. [67]

    Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval.arXiv preprint arXiv:2404.18424 (2024)