Attention Grounded Enhancement for Visual Document Retrieval

Junfeng Ma; Keping Bi; Meiguang Jin; Wanqing Cui; Wei Huang; Yazhi Guo; Yibo Hu

arxiv: 2511.13415 · v2 · submitted 2025-11-17 · 💻 cs.IR · cs.CL· cs.CV

Attention Grounded Enhancement for Visual Document Retrieval

Wanqing Cui , Wei Huang , Yazhi Guo , Yibo Hu , Meiguang Jin , Junfeng Ma , Keping Bi This is my paper

Pith reviewed 2026-05-17 20:49 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.CV

keywords visual document retrievalattention grounded enhancementmultimodal large language modelsproxy supervisionfine-grained relevanceregion-level signalsretrieval performance

0 comments

The pith

Cross-modal attention from multimodal models provides effective local supervision for training visual document retrievers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to enhance visual document retrieval by using attention maps extracted from multimodal large language models as additional training signals. These maps highlight relevant regions within documents based on the query, which are then used alongside global relevance labels to train the retriever. This addresses the limitation of relying only on coarse labels that do not indicate which parts of the document support the match. As a result, the retriever can better capture nuanced and implicit semantic connections instead of surface-level cues. A sympathetic reader would care because this leads to more accurate retrieval for complex, non-extractive information needs in documents.

Core claim

The AGREE framework extracts attention maps from MLLMs that indicate which document regions are attended to for a given query. These attention scores act as local relevance signals. During training, the retriever is optimized using both these local signals and the global document-level relevance label. This dual supervision allows the model to learn not only document-query matches but also the specific content that drives those matches, resulting in improved performance on visual document retrieval benchmarks.

What carries the argument

The attention maps from multimodal large language models used as proxy supervision to guide identification of relevant document regions in the retriever.

Load-bearing premise

Attention maps from the multimodal large language model reliably indicate the document regions most relevant to the query.

What would settle it

Running the AGREE-trained retriever on a test set where relevant regions have been manually annotated and finding that it does not align better with those annotations than the baseline would falsify the value of the attention supervision.

Figures

Figures reproduced from arXiv: 2511.13415 by Junfeng Ma, Keping Bi, Meiguang Jin, Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu.

**Figure 2.** Figure 2: Overview of the AGREE training framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Human annotation (left) and top-3% high attention [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Results on ViDoRe V1 using PaliGemma and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Retrieval performance on ViDoRe V2 related to the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Coverage of human-annotated matching areas by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Late interaction heatmaps of ColQwen2.5 (left) and AGREEQwen2.5 (middle). The right panels zoom in on key regions. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy implicit information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction to encode holistic information and capture nuanced alignments, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries.To improve fine-grained relevance modeling, we propose a Attention-Grounded REtriever Enhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision to guide the retriever in identifying relevant document regions. Specifically, AGREE extracts attention maps from the MLLM that highlight which document regions are attended to based on the query. These attention scores serve as local, region-level relevance signals. During training, AGREE combines local signals with the global document-level relevance label to jointly optimize the retriever. This dual-level supervision enables the model to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging visual document retrieval benchmark, ViDoRe V2, show that AGREE significantly outperforms the global-supervision-only baseline by 12.82\% and 5.03\% in terms of average nDCG@1 and nDCG@5. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://github.com/VickiCui/AGREE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AGREE adds MLLM attention maps as local supervision signals on top of global labels for visual document retrieval and reports clear nDCG gains on ViDoRe V2, but the maps' reliability as unbiased relevance proxies is the main untested piece.

read the letter

The main thing here is that AGREE pulls cross-modal attention from an MLLM and uses those maps as region-level supervision to train the retriever alongside the standard global relevance label. That dual signal is the concrete addition, and the abstract shows it delivers 12.82% better nDCG@1 and 5.03% better nDCG@5 over the global-only baseline on ViDoRe V2. They also mention quantitative and qualitative checks that the model aligns better with query terms instead of surface cues. The code release is helpful for anyone who wants to inspect the implementation directly. What works is the straightforward framing: extract the attention, treat the scores as local targets, and optimize jointly. It gives a practical handle on fine-grained modeling without inventing new architectures from scratch. The soft spot is the assumption that the MLLM attention maps actually point to the regions that matter for relevance. If the maps mostly reflect tokenization artifacts, training biases, or generic saliency rather than the implicit semantic matches the queries need, then the extra signal may not be adding much beyond what the global label already provides. The abstract does not detail controls for that or statistical significance, so the full paper has to show those checks or the gains stay provisional. This is aimed at people already working on screenshot-based document retrieval and late-interaction models. A reader who cares about multimodal alignment or proxy supervision techniques would get usable ideas and numbers to compare against. I would send it to peer review. The empirical results are there to discuss and the method is simple enough that referees can focus on whether the supervision actually holds up.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Attention-Grounded REtriever Enhancement (AGREE) framework for visual document retrieval. It extracts cross-modal attention maps from an MLLM to generate local region-level relevance signals, which are combined with standard global document-level labels to train a screenshot-based retriever that uses fine-grained late interaction. On the ViDoRe V2 benchmark the method reports average gains of 12.82% nDCG@1 and 5.03% nDCG@5 over a global-supervision-only baseline, together with quantitative and qualitative evidence that the dual supervision promotes deeper query-region alignment.

Significance. If the performance lift can be causally attributed to the attention-derived local signals rather than training artifacts or MLLM biases, the work supplies a practical route to fine-grained supervision in visual document retrieval without manual region annotations. The dual-objective formulation directly targets the acknowledged limitation of coarse labels and the public code release aids reproducibility.

major comments (2)

[Experiments] Experiments section: the headline gains of 12.82% nDCG@1 and 5.03% nDCG@5 are reported without accompanying standard deviations, number of random seeds, or statistical significance tests, so it is impossible to determine whether the improvements exceed training stochasticity.
[Method] Method section (attention-map extraction and loss combination): the central claim that MLLM cross-modal attention maps supply reliable, query-specific relevance supervision is not supported by any direct comparison to human-annotated relevant regions; without such validation the observed lift could arise from generic saliency or model artifacts rather than semantic alignment.

minor comments (2)

[Abstract] The abstract states that AGREE 'significantly outperforms' the baseline but does not name the exact baseline architecture or training hyper-parameters used for the comparison.
[Method] Notation for how attention scores are normalized and injected into the training objective would benefit from an explicit equation in the main text rather than only in the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will update the paper accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline gains of 12.82% nDCG@1 and 5.03% nDCG@5 are reported without accompanying standard deviations, number of random seeds, or statistical significance tests, so it is impossible to determine whether the improvements exceed training stochasticity.

Authors: We agree that the current reporting leaves the gains vulnerable to questions of stochasticity. In the revision we will rerun all experiments using five random seeds, report mean and standard deviation for nDCG@1 and nDCG@5, and include paired t-test p-values against the global-supervision baseline to establish statistical significance. revision: yes
Referee: [Method] Method section (attention-map extraction and loss combination): the central claim that MLLM cross-modal attention maps supply reliable, query-specific relevance supervision is not supported by any direct comparison to human-annotated relevant regions; without such validation the observed lift could arise from generic saliency or model artifacts rather than semantic alignment.

Authors: We acknowledge that a direct human-region annotation study would constitute stronger evidence. The manuscript already supplies indirect support via quantitative alignment metrics (improved query-term to region matching scores) and qualitative visualizations that distinguish semantic focus from generic saliency. We will expand the discussion section to foreground these existing analyses, add an explicit limitations paragraph on the lack of human validation, and, if resources permit, include a small-scale human study in the camera-ready version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external MLLM attention signals

full rationale

The paper's core contribution is an empirical training procedure that augments global relevance labels with region-level signals extracted from an off-the-shelf MLLM's cross-modal attention maps. No equations, self-citations, or fitted parameters are shown that reduce the claimed nDCG improvements to a tautology or to the input labels themselves. The method is self-contained against the ViDoRe V2 benchmark once the (external) assumption about attention-map reliability is granted; the derivation chain does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MLLM attention maps provide valid region-level relevance signals; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Cross-modal attention maps from MLLMs accurately identify document regions that support query relevance.
This assumption is invoked when the attention scores are treated as proxy supervision without further validation described in the abstract.

pith-pipeline@v0.9.0 · 5620 in / 1075 out tokens · 29483 ms · 2026-05-17T20:49:33.876903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision... combines local signals with the global document-level relevance label
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on the challenging visual document retrieval benchmark, ViDoRe V2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
cs.IR 2026-04 unverdicted novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Hassan Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio C’esar Teodoro Mendes, Weizhu Chen, Vishrav Chaud- hary, Parul Chopra, Allison Del Giorno, Gustavo de Rosa, Matthew ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Mothilal Asokan, Kebin Wu, and Fatima Albreiki. 2025. FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs. InProceedings of the Computer Vision and Pattern Recognition Conference. 14495–14504

work page 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Chris- tos Kaplanis, Alexey A Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, et al. 2024. Improving fine-grained understanding in image-text pre- training.arXiv preprint arXiv:2401.09865(2024)

work page arXiv 2024
[7]

Kang Chen and Xiangqian Wu. 2024. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27218–27227

work page 2024
[8]

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952(2024)

work page arXiv 2024
[9]

Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O’Brien, and Vasu Sharma. 2025. Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation.arXiv preprint arXiv:2505.21549(2025)

work page arXiv 2025
[10]

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers.arXiv preprint arXiv:2309.16588(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Jianfeng Dong, Minsong Zhang, Zheng Zhang, Xianke Chen, Daizong Liu, Xiaoye Qu, Xun Wang, and Baolong Liu. 2023. Dual learning with dynamic knowledge distillation for partially relevant video retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11302–11312

work page 2023
[12]

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations

work page 2024
[13]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question answering.arXiv preprint arXiv:2012.04584(2020)

work page arXiv 2020
[15]

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)

work page internal anchor Pith review arXiv 2024
[16]

Heegon Jin, Seonil Son, Jemin Park, Youngseok Kim, Hyungjong Noh, and Yeon- soo Lee. 2024. Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 722–732

work page 2024
[17]

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Wei Wei, Huiwen Zhao, Zhiwu Lu, et al. 2024. Fineclip: Self-distilled region-based clip for better fine-grained understanding.Advances in Neural Information Processing Systems37 (2024), 27896–27918

work page 2024
[18]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

work page 2020
[19]

Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

work page 2020
[20]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning.Advances in neural information processing systems33 (2020), 18661– 18673

work page 2020
[21]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[22]

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231(2024)

work page arXiv 2024
[23]

Zizhong Li, Haopeng Zhang, and Jiawei Zhang. 2024. Intermediate distillation: Data-efficient distillation from black-box llms for information retrieval.arXiv preprint arXiv:2406.12169(2024)

work page arXiv 2024
[24]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021). 163–173

work page 2021
[25]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning

work page 2023
[27]

Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, and Kang Liu

work page
[28]

Mmhqa-icl: Multimodal in-context learning for hybrid question answering over text, tables and images.arXiv preprint arXiv:2309.04790(2023)

work page arXiv 2023
[29]

Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645–2652

work page 2020
[30]

Haohao Luo, Ying Shen, and Yang Deng. 2023. Unifying text, tables, and images for multimodal question answering. Association for Computational Linguistics

work page 2023
[31]

Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using multimodal contrastive knowledge distillation for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology33, 10 (2023), 5486–5497

work page 2023
[32]

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin

work page
[33]

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Unifying Multimodal Retrieval via Document Screenshot Embedding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6492–6505

work page 2024
[34]

Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arXiv:2505.17166 [cs.IR] https://arxiv. org/abs/2505.17166

work page arXiv 2025
[35]

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706

work page 2022
[36]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209

work page 2021
[37]

Jamshed Memon, Maira Sami, Rizwan Ahmed Khan, and Mueen Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR).IEEE access8 (2020), 142642–142668

work page 2020
[38]

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9826–9836

work page 2021
[39]

pdfminer. 2014. pdfminer.six. https://github.com/pdfminer/pdfminer.six

work page 2014
[40]

pymupdf. 2012. PyMuPDF. https://github.com/pymupdf/PyMuPDF. Attention Grounded Enhancement for Visual Document Retrieval Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

work page 2012
[41]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[42]

Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao

work page
[43]

Dynamic contrastive distillation for image-text retrieval.IEEE Transactions on Multimedia25 (2023), 8383–8395

work page 2023
[44]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[45]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

work page 2009
[46]

Bharat Bhusan Sau, Soumya Roy, Vinay P Namboodiri, and Raghu Sesha Iyengar

work page
[47]

Deep Knowledge Distillation using Trainable Dense Attention.. InBMVC. 72

work page
[48]

Sungho Shin, Joosoon Lee, Junseok Lee, Yeonguk Yu, and Kyoobin Lee. 2022. Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition. InEuropean Conference on Computer Vision. Springer, 631–647

work page 2022
[49]

Ray Smith. 2007. An overview of the Tesseract OCR engine. InNinth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, 629– 633

work page 2007
[50]

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of documentation28, 1 (1972), 11–21

work page 1972
[51]

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. Slidevqa: A dataset for document visual question an- swering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13636–13645

work page 2023
[52]

Nomic Team. 2025. Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval. https://nomic.ai/blog/posts/nomic- embed-multimodal

work page 2025
[53]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Pengfei Wang, Guohai Xu, Weinong Wang, Junjie Yang, Jie Lou, and Yunhua Xue. 2025. Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis.arXiv preprint arXiv:2505.10541 (2025)

work page arXiv 2025
[55]

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

work page 2024
[56]

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. 2025. FG-CLIP: Fine-Grained Visual and Textual Alignment.arXiv preprint arXiv:2505.05071(2025)

work page arXiv 2025
[57]

Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. 2025. Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model. arXiv:2507.05513 [cs.CV] https://arxiv.org/abs/2507.05513

work page arXiv 2025
[58]

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiao- dan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783(2021)

work page arXiv 2021
[59]

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2022. Retrieval- augmented multimodal language modeling.arXiv preprint arXiv:2211.12561 (2022)

work page arXiv 2022
[60]

Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li. 2023. Unified Language Representation for Question Answering over Text, Tables, and Images. InFindings of the Association for Computational Linguistics: ACL 2023. 4756–4765

work page 2023
[61]

Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to atten- tion: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

work page 2023
[63]

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. 2025. Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422(2025)

work page arXiv 2025
[64]

Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Jianbing Shen, Guodong Long, Can Xu, and Daxin Jiang. 2024. Fine-grained distillation for long document retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19732–19740

work page 2024
[65]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624(2021)

work page arXiv 2021
[66]

Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon

work page
[67]

Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval.arXiv preprint arXiv:2404.18424 (2024)

work page arXiv 2024

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Hassan Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio C’esar Teodoro Mendes, Weizhu Chen, Vishrav Chaud- hary, Parul Chopra, Allison Del Giorno, Gustavo de Rosa, Matthew ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Mothilal Asokan, Kebin Wu, and Fatima Albreiki. 2025. FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs. InProceedings of the Computer Vision and Pattern Recognition Conference. 14495–14504

work page 2025

[3] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [5]

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [6]

Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Chris- tos Kaplanis, Alexey A Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, et al. 2024. Improving fine-grained understanding in image-text pre- training.arXiv preprint arXiv:2401.09865(2024)

work page arXiv 2024

[6] [7]

Kang Chen and Xiangqian Wu. 2024. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27218–27227

work page 2024

[7] [8]

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952(2024)

work page arXiv 2024

[8] [9]

Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O’Brien, and Vasu Sharma. 2025. Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation.arXiv preprint arXiv:2505.21549(2025)

work page arXiv 2025

[9] [10]

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers.arXiv preprint arXiv:2309.16588(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [11]

Jianfeng Dong, Minsong Zhang, Zheng Zhang, Xianke Chen, Daizong Liu, Xiaoye Qu, Xun Wang, and Baolong Liu. 2023. Dual learning with dynamic knowledge distillation for partially relevant video retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11302–11312

work page 2023

[11] [12]

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations

work page 2024

[12] [13]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [14]

Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question answering.arXiv preprint arXiv:2012.04584(2020)

work page arXiv 2020

[14] [15]

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)

work page internal anchor Pith review arXiv 2024

[15] [16]

Heegon Jin, Seonil Son, Jemin Park, Youngseok Kim, Hyungjong Noh, and Yeon- soo Lee. 2024. Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 722–732

work page 2024

[16] [17]

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Wei Wei, Huiwen Zhao, Zhiwu Lu, et al. 2024. Fineclip: Self-distilled region-based clip for better fine-grained understanding.Advances in Neural Information Processing Systems37 (2024), 27896–27918

work page 2024

[17] [18]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

work page 2020

[18] [19]

Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

work page 2020

[19] [20]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning.Advances in neural information processing systems33 (2020), 18661– 18673

work page 2020

[20] [21]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

work page 2020

[21] [22]

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231(2024)

work page arXiv 2024

[22] [23]

Zizhong Li, Haopeng Zhang, and Jiawei Zhang. 2024. Intermediate distillation: Data-efficient distillation from black-box llms for information retrieval.arXiv preprint arXiv:2406.12169(2024)

work page arXiv 2024

[23] [24]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021). 163–173

work page 2021

[24] [25]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024

[25] [26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning

work page 2023

[26] [27]

Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, and Kang Liu

work page

[27] [28]

Mmhqa-icl: Multimodal in-context learning for hybrid question answering over text, tables and images.arXiv preprint arXiv:2309.04790(2023)

work page arXiv 2023

[28] [29]

Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645–2652

work page 2020

[29] [30]

Haohao Luo, Ying Shen, and Yang Deng. 2023. Unifying text, tables, and images for multimodal question answering. Association for Computational Linguistics

work page 2023

[30] [31]

Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using multimodal contrastive knowledge distillation for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology33, 10 (2023), 5486–5497

work page 2023

[31] [32]

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin

work page

[32] [33]

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Unifying Multimodal Retrieval via Document Screenshot Embedding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6492–6505

work page 2024

[33] [34]

Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arXiv:2505.17166 [cs.IR] https://arxiv. org/abs/2505.17166

work page arXiv 2025

[34] [35]

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706

work page 2022

[35] [36]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209

work page 2021

[36] [37]

Jamshed Memon, Maira Sami, Rizwan Ahmed Khan, and Mueen Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR).IEEE access8 (2020), 142642–142668

work page 2020

[37] [38]

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9826–9836

work page 2021

[38] [39]

pdfminer. 2014. pdfminer.six. https://github.com/pdfminer/pdfminer.six

work page 2014

[39] [40]

pymupdf. 2012. PyMuPDF. https://github.com/pymupdf/PyMuPDF. Attention Grounded Enhancement for Visual Document Retrieval Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

work page 2012

[40] [41]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021

[41] [42]

Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao

work page

[42] [43]

Dynamic contrastive distillation for image-text retrieval.IEEE Transactions on Multimedia25 (2023), 8383–8395

work page 2023

[43] [44]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[44] [45]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

work page 2009

[45] [46]

Bharat Bhusan Sau, Soumya Roy, Vinay P Namboodiri, and Raghu Sesha Iyengar

work page

[46] [47]

Deep Knowledge Distillation using Trainable Dense Attention.. InBMVC. 72

work page

[47] [48]

Sungho Shin, Joosoon Lee, Junseok Lee, Yeonguk Yu, and Kyoobin Lee. 2022. Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition. InEuropean Conference on Computer Vision. Springer, 631–647

work page 2022

[48] [49]

Ray Smith. 2007. An overview of the Tesseract OCR engine. InNinth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, 629– 633

work page 2007

[49] [50]

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of documentation28, 1 (1972), 11–21

work page 1972

[50] [51]

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. Slidevqa: A dataset for document visual question an- swering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13636–13645

work page 2023

[51] [52]

Nomic Team. 2025. Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval. https://nomic.ai/blog/posts/nomic- embed-multimodal

work page 2025

[52] [53]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [54]

Pengfei Wang, Guohai Xu, Weinong Wang, Junjie Yang, Jie Lou, and Yunhua Xue. 2025. Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis.arXiv preprint arXiv:2505.10541 (2025)

work page arXiv 2025

[54] [55]

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. InEuropean Conference on Computer Vision. Springer, 387–404

work page 2024

[55] [56]

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. 2025. FG-CLIP: Fine-Grained Visual and Textual Alignment.arXiv preprint arXiv:2505.05071(2025)

work page arXiv 2025

[56] [57]

Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. 2025. Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model. arXiv:2507.05513 [cs.CV] https://arxiv.org/abs/2507.05513

work page arXiv 2025

[57] [58]

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiao- dan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783(2021)

work page arXiv 2021

[58] [59]

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2022. Retrieval- augmented multimodal language modeling.arXiv preprint arXiv:2211.12561 (2022)

work page arXiv 2022

[59] [60]

Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li. 2023. Unified Language Representation for Question Answering over Text, Tables, and Images. InFindings of the Association for Computational Linguistics: ACL 2023. 4756–4765

work page 2023

[60] [61]

Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to atten- tion: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[61] [62]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

work page 2023

[62] [63]

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. 2025. Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422(2025)

work page arXiv 2025

[63] [64]

Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Jianbing Shen, Guodong Long, Can Xu, and Daxin Jiang. 2024. Fine-grained distillation for long document retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19732–19740

work page 2024

[64] [65]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624(2021)

work page arXiv 2021

[65] [66]

Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon

work page

[66] [67]

Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval.arXiv preprint arXiv:2404.18424 (2024)

work page arXiv 2024