pith. sign in

arxiv: 2605.30027 · v1 · pith:B6SFSZYLnew · submitted 2026-05-28 · 💻 cs.CV · cs.IR

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

Pith reviewed 2026-06-29 08:07 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords multimodal document retrievallayout-aware sparse embeddingshybrid encodingfew-shot rerankerreasoning-augmented demonstrationsMultiDocR benchmarkOCR-free retrieval
0
0 comments X

The pith

DocRetriever improves multimodal document retrieval with layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable few-shot reranker.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that multimodal documents with tables, figures, and complex layouts can be retrieved more effectively by using a layout-aware sparse embedding technique alongside dense visual embeddings. This hybrid approach works without the need for optical character recognition. It also demonstrates a reranker that uses reasoning-augmented demonstrations and optimized sampling to achieve better accuracy in few-shot settings across different domains. Current methods fail because dense embeddings are too coarse and lose explicit semantics, while supervised rerankers require domain-specific data and do not generalize. The authors support their claims with a new benchmark called MultiDocR that includes diverse assessment dimensions and comprehensive relevance annotations.

Core claim

DocRetriever is a plug-and-play framework that enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

What carries the argument

layout-aware sparse embedding technique enabling hybrid encoding without OCR, and generalizable reranker using reasoning-augmented demonstrations and optimized sampling

Load-bearing premise

That the layout-aware sparse embeddings can capture structurally salient information effectively without using OCR, and that the proposed reranker will generalize across domains when using reasoning-augmented demonstrations in few-shot settings.

What would settle it

If experiments on the MultiDocR benchmark show that standard dense embeddings plus a supervised reranker achieve equal or higher retrieval accuracy than DocRetriever, or if the sparse component adds no benefit when ablated.

Figures

Figures reproduced from arXiv: 2605.30027 by Bo Chen, Jieming Zhu, Li Tang, Menghui Zhu, Minjie Hong, Ruofan Hu, Sashuai Zhou, Shengyang Xu, Tao Jin, Xiaoda Yang, Zhou Zhao.

Figure 1
Figure 1. Figure 1: Model architecture of our Hybrid Encoding (left) and Reranker with ICL (right). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Score Distribution. Moreover, standard ICL relies exclusively on textual query sim￾ilarity, often failing to capture the visual cues essential for gen￾eralization in document reranking. To address this, DocRetriever employs a dual-alignment strategy that integrates query semantics with document visual similarity. Based on our hybrid embeddings, we retrieve the most similar demonstrations based on a joint m… view at source ↗
Figure 3
Figure 3. Figure 3: Token distribution from different VLMs. to process vocabulary-scale distributions: (1) employing indepen￾dently trained mapping layers for dimensionality transformation [8, 33], and (2) utilizing the VLM’s input embedding matrix to project the vocabulary distribution space R |V | back into a latent se￾mantic space R 𝑑 [24, 65]. Using ColQwen as a backbone, we conduct comparative experiments between these p… view at source ↗
Figure 4
Figure 4. Figure 4: nDCG@10 at different 𝑘 values vs 𝑤𝑑𝑒𝑛𝑠𝑒 . 6 Conclusion In this work, we presented DocRetriever, a plug-and-play frame￾work for multimodal document retrieval with a rigorous evaluation benchmark. Our primary contributions are threefold: First, we pro￾posed a layout-aware hybrid encoding scheme that extracts sparse signals directly from VLM hidden states, significantly boosting the retrieval precision of exi… view at source ↗
read the original abstract

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DocRetriever, a plug-and-play framework for multimodal document retrieval. It proposes a layout-aware sparse embedding technique to enable effective hybrid encoding without OCR, a generalizable reranker that uses reasoning-augmented demonstrations and optimized sampling for few-shot settings, and a new benchmark MultiDocR with comprehensive relevance annotations. The central claim is that these components yield superior retrieval performance over state-of-the-art dense-embedding-plus-supervised-reranker pipelines across diverse benchmarks.

Significance. If the claims hold with detailed, reproducible evidence, the work would address two practical bottlenecks in document retrieval (coarse dense embeddings that miss layout structure and domain-specific supervised rerankers) while supplying a new evaluation resource. The plug-and-play framing and avoidance of OCR are practically attractive strengths.

major comments (3)
  1. [§3] §3 (Method), layout-aware sparse embedding subsection: the claim that the technique 'captures structurally salient information' (tables, figures, layout) without OCR is load-bearing for the hybrid-encoding superiority argument, yet the manuscript supplies no description of the sparse feature extraction process, the layout encoding mechanism, or any ablation that isolates the layout component from a standard sparse baseline.
  2. [§3.2] §3.2 (Reranker), few-shot generalization paragraph: the assertion that reasoning-augmented demonstrations plus optimized sampling produce cross-domain generalization rests on unshown implementation details (exact sampling procedure, how reasoning is injected into demonstrations, and any cross-domain transfer results). Without these, the comparison to supervised rerankers cannot be assessed.
  3. [§4] §4 (Experiments), main results table: the superiority claim is stated but the provided text contains no quantitative numbers, error bars, statistical tests, or ablation tables that would allow verification that the gains are attributable to the proposed components rather than implementation choices.
minor comments (2)
  1. [Abstract / §2] The abstract and introduction refer to 'diverse benchmarks' and 'MultiDocR' without listing the exact datasets or annotation protocol in the opening sections; moving a concise table of benchmark statistics to §2 would improve readability.
  2. [§3.1] Notation for the sparse embedding (e.g., how layout tokens are represented) is introduced without an explicit equation or diagram; adding a small schematic in §3.1 would clarify the hybrid encoding pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate clarifications and additional results into the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method), layout-aware sparse embedding subsection: the claim that the technique 'captures structurally salient information' (tables, figures, layout) without OCR is load-bearing for the hybrid-encoding superiority argument, yet the manuscript supplies no description of the sparse feature extraction process, the layout encoding mechanism, or any ablation that isolates the layout component from a standard sparse baseline.

    Authors: We agree the description of the sparse feature extraction process and layout encoding mechanism requires expansion. The revised §3 will detail the extraction steps, how layout information is encoded without OCR, and include an ablation isolating the layout component versus a standard sparse baseline. revision: yes

  2. Referee: [§3.2] §3.2 (Reranker), few-shot generalization paragraph: the assertion that reasoning-augmented demonstrations plus optimized sampling produce cross-domain generalization rests on unshown implementation details (exact sampling procedure, how reasoning is injected into demonstrations, and any cross-domain transfer results). Without these, the comparison to supervised rerankers cannot be assessed.

    Authors: We will expand §3.2 with the exact sampling procedure, the method for injecting reasoning into demonstrations, and cross-domain transfer results to substantiate the generalization claims. revision: yes

  3. Referee: [§4] §4 (Experiments), main results table: the superiority claim is stated but the provided text contains no quantitative numbers, error bars, statistical tests, or ablation tables that would allow verification that the gains are attributable to the proposed components rather than implementation choices.

    Authors: The manuscript tables contain quantitative results, but we acknowledge the absence of error bars, statistical tests, and expanded ablations. The revision will add these to allow verification of component contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: new methods and benchmark presented without reduction to inputs or self-citations

full rationale

The paper proposes DocRetriever as a new plug-and-play framework with layout-aware sparse embeddings for hybrid encoding, a reasoning-augmented reranker for few-shot settings, and a new MultiDocR benchmark. None of the enumerated circularity patterns appear in the abstract or described claims. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or imported uniqueness theorems are referenced. The derivation chain consists of identifying limitations in prior dense+supervised approaches and asserting novel components to address them, which remains self-contained against external benchmarks rather than reducing by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5752 in / 1123 out tokens · 30758 ms · 2026-06-29T08:07:20.532865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 49 canonical work pages · 18 internal anchors

  1. [1]

    IJsbrand Jan Aalbersberg. 1994. A document retrieval model based on term frequency ranks. InSIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University. Springer, 163–172

  2. [2]

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)

  3. [3]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  5. [5]

    Roelien Bastiaanse, Martijn Wieling, and Nienke Wolthuis. 2016. The role of frequency in the retrieval of nouns and verbs in aphasia.Aphasiology30, 11 (2016), 1221–1239

  6. [6]

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)

  7. [7]

    Antoine Chaffin and Aurélien Lac. 2024. MonoQwen: Visual Document Reranking. https://huggingface.co/lightonai/MonoQwen2-VL-v0.1

  8. [8]

    Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter, Al- bin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang, and Yin- fei Yang. 2023. Stair: Learning sparse text and image representation in grounded tokens.arXiv preprint arXiv:2301.13081(2023)

  9. [9]

    Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. 2023. Walking down the memory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029(2023)

  10. [10]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216 (2024)

  11. [11]

    Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, Zhou Zhao, et al. 2025. Voxdialogue: Can spoken dialogue systems understand information beyond words?. InInternational Conference on Learning Representations, Vol. 2025. 288–303

  12. [12]

    Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Shar- ifah Mahani Aljunied, Soujanya Poria, and Lidong Bing. 2024. M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval- aware tuning framework.arXiv preprint arXiv:2411.06176(2024)

  13. [13]

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2025. M3DocVQA: Multi-modal Multi-page Multi-document Understanding. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision. 6178–6188

  14. [14]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  15. [15]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  16. [16]

    Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, and Soyeon Caren Han. 2024. MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering.arXiv preprint arXiv:2404.12720(2024)

  17. [17]

    Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, and Yong Liu. 2025. MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents. arXiv preprint arXiv:2501.08828(2025)

  18. [18]

    Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, and Yong Liu. 2025. Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering.arXiv preprint arXiv:2505.16470(2025)

  19. [19]

    Minghui Fang, Shengpeng Ji, Jialong Zuo, Xize Cheng, Wenrui Liu, Xiaoda Yang, Ruofan Hu, Jieming Zhu, and Zhou Zhao. 2025. GTA: Towards generative text- to-audio retrieval via multi-scale tokenizer. InProc. Interspeech. 2650–2654

  20. [20]

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations

  21. [21]

    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse lexical and expansion model for first stage ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292

  22. [22]

    Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Sedigheh Eslami, Scott Martens, Bo Wang, Nan Wang, and Han Xiao

  23. [23]

    jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval.arXiv preprint arXiv:2506.18902(2025)

  24. [24]

    Minjie Hong, Zetong Zhou, Zirun Guo, Ziang Zhang, Ruofan Hu, Weinan Gan, Jieming Zhu, and Zhou Zhao. 2025. Generative Reasoning Recommendation via LLMs.arXiv preprint arXiv:2510.20815(2025)

  25. [25]

    Oleksii Hrinchuk, Valentin Khrulkov, Leyla Mirvakhabova, Elena Orlova, and Ivan Oseledets. 2019. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787(2019)

  26. [26]

    Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, and Tao Jin. 2025. Vela: Scalable embeddings with voice large language models for multimodal retrieval.arXiv preprint arXiv:2506.14445(2025)

  27. [27]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in- formation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118 (2021)

  28. [28]

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)

  29. [29]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

  30. [30]

    Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, and Andrew Yates. 2024. Meta-task prompting elicits embeddings from large language models. arXiv preprint arXiv:2402.18458(2024)

  31. [31]

    Michael E Lesk. 1969. Word-word associations in document retrieval systems. American documentation20, 1 (1969), 27–38

  32. [32]

    Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. 2023. Making Large Lan- guage Models A Better Foundation For Dense Retrieval. arXiv:2312.15503 [cs.CL]

  33. [33]

    Haiyang Li. 2025. Mrg-bench: Evaluating and exploring the requirements of context for repository-level code generation.arXiv preprint arXiv:2508.02998 (2025)

  34. [34]

    Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, and Tong He. 2024. Unified lexical representation for interpretable visual-language alignment.Ad- vances in Neural Information Processing Systems37 (2024), 1141–1161

  35. [35]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

  36. [36]

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catan- zaro, and Wei Ping. 2024. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571(2024)

  37. [37]

    Qi Liu, Haozhe Duan, Yiqun Chen, Quanfeng Lu, Weiwei Sun, and Jiaxin Mao

  38. [38]

    Llm4ranking: An easy-to-use framework of utilizing large language models for document reranking.arXiv preprint arXiv:2504.07439(2025)

  39. [39]

    Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit.arXiv preprint cs/0205028(2002)

  40. [40]

    Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, and Xipeng Qiu. 2025. To- wards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning.arXiv preprint arXiv:2510.26205(2025)

  41. [43]

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin

  42. [44]

    arXiv:2406.11251(2024)

    Unifying Multimodal Retrieval via Document Screenshot Embedding. arXiv:2406.11251(2024)

  43. [45]

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. 2024. Mmlongbench-doc: Benchmark- ing long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523(2024)

  44. [46]

    Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval.arXiv preprint arXiv:2505.17166(2025)

  45. [47]

    Priyanka Mandikal and Raymond Mooney. 2024. Sparse meets dense: A hybrid approach to enhance scientific document retrieval.arXiv preprint arXiv:2401.04055 (2024)

  46. [48]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706

  47. [49]

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209

  48. [50]

    Thong Nguyen, Mariya Hendriksen, and Andrew Yates. 2024. Multimodal learned sparse retrieval for image suggestion.arXiv preprint arXiv:2402.07736(2024)

  49. [51]

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document ranking with a pretrained sequence-to-sequence model.arXiv preprint arXiv:2003.06713(2020)

  50. [52]

    Joël Plisson, Nada Lavrac, Dunja Mladenic, et al. 2004. A rule based approach to word lemmatization. InProceedings of IS, Vol. 3. sn, 83–86

  51. [53]

    Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. InProceedings of the first instructional conference on machine learning, Vol. 242. New Jersey, USA, 29–48. DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea

  52. [54]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

  53. [55]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

  54. [56]

    Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. 2023. Slidevqa: A dataset for document visual question an- swering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13636–13645

  55. [57]

    Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Zhengwei Tao, and Shuai Ma. 2024. LLMs are Also Effective Embedding Models: An In-depth Overview.arXiv preprint arXiv:2412.12591(2024)

  56. [58]

    Raghuveer Thirukovalluru and Bhuwan Dhingra. 2024. Geneol: Harnessing the generative power of llms for training-free sentence embeddings.arXiv preprint arXiv:2410.14635(2024)

  57. [59]

    Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method.arXiv preprint physics/0004057(2000)

  58. [60]

    Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. 2023. Hierarchical multi- modal transformers for multipage docvqa.Pattern Recognition144 (2023), 109834

  59. [61]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

  60. [62]

    Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. 2023. Document understanding dataset and evaluation (dude). InProceedings of the IEEE/CVF International Conference on Computer Vision. 19528–19540

  61. [63]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

  62. [64]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672(2024)

  63. [65]

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. 2024. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9440–9450

  64. [66]

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG type ranking measures. InConference on learning theory. PMLR, 25–54

  65. [67]

    Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, and Hengxing Cai. 2025. MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval.arXiv preprint arXiv:2506.12364(2025)

  66. [68]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  67. [69]

    Wenkai Yang, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, and Bin He. 2021. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models.arXiv preprint arXiv:2103.15543(2021)

  68. [70]

    Xiaoda Yang, Xize Cheng, Minghui Fang, Hongshun Qiu, Yuhang Ma, JunYu Lu, Jiaqi Duan, Sihang Cai, Zehan Wang, Ruofan Hu, et al . 2025. Multimodal conditional retrieval with high controllability. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 3577–3585

  69. [71]

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)

  70. [72]

    Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, and Nan Duan. 2022. Multi-view document representation learning for open-domain dense retrieval. arXiv preprint arXiv:2203.08372(2022)

  71. [73]

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. 2024. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 1393–1412

  72. [74]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

  73. [75]

    Junjie Zhou, Ze Liu, Lei Xiong, Jin-Ge Yao, Yueze Wang, Shitao Xiao, Fenfen Lin, Miguel Hu Chen, Zhicheng Dou, Siqi Bao, et al. 2025. MR2-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval.arXiv preprint arXiv:2509.26378 (2025)

  74. [76]

    Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen. 2024. Visual in- context learning for large vision-language models.arXiv preprint arXiv:2402.11574 (2024)

  75. [77]

    Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. 2022. Towards complex document understanding by discrete reasoning. In Proceedings of the 30th ACM International Conference on Multimedia. 4857–4866

  76. [78]

    Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon

  77. [79]

    Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval.arXiv preprint arXiv:2404.18424 (2024)

  78. [80]

    WANG Zhuohao, WANG Dong, and LI Qing. 2021. Keyword extraction from scientific research projects based on SRP-TF-IDF.Chinese Journal of Electronics 30, 4 (2021), 652–657

  79. [81]

    Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, and Dong Yu. 2024. Docbench: A benchmark for evaluating llm-based document reading systems.arXiv preprint arXiv:2407.10701(2024). A Reinforced ICL Details Hyperparameter Configuration Temperature0.2 Top-𝑝0.95 Confidence Threshold>0.8 Max Examples (𝑘)4(2 positive, 2 negati...